If you’ve ever squinted at -O2 output and noticed that your carefully-constructed struct Point { int x, y; } has vanished entirely, the pass responsible is SROA — Scalar Replacement of Aggregates. SROA takes an alloca of a struct or array and splits it into smaller pieces (often individual scalar values), which can then be promoted to SSA registers. The struct stops existing in memory at all.

This post walks through how SROA does that: the slicing algorithm, the partitioning step, the rewriter, and the hand-off to mem2reg. All line numbers refer to llvm/lib/Transforms/Scalar/SROA.cpp. The header is llvm/include/llvm/Transforms/Scalar/SROA.h.

The problem

Consider this C code:

struct Point { int x, y; };

int getX() {
  Point p;
  p.x = 10;
  p.y = 20;
  return p.x;
}

Before SROA, Clang emits:

define i32 @getX() {
entry:
  %p = alloca { i32, i32 }
  %gep.x = getelementptr { i32, i32 }, ptr %p, i32 0, i32 0
  store i32 10, ptr %gep.x
  %gep.y = getelementptr { i32, i32 }, ptr %p, i32 0, i32 1
  store i32 20, ptr %gep.y
  %v.x = load i32, ptr %gep.x
  ret i32 %v.x
}

Every other optimization pass now has a problem: %p is a single 8-byte alloca, and the fields are accessed via GEPs. Without SROA, passes like mem2reg don’t try to reason about partial accesses to an alloca — they only promote allocas whose every load/store covers the whole allocation. So %p stays in memory, and the final binary has stack traffic for what should be a constant.

After SROA:

define i32 @getX() {
entry:
  %p.x = alloca i32
  %p.y = alloca i32
  store i32 10, ptr %p.x
  store i32 20, ptr %p.y
  %v.x = load i32, ptr %p.x
  ret i32 %v.x
}

Now each field has its own scalar alloca. mem2reg can immediately promote both to SSA values, and dead-code elimination throws away the store to %p.y (since nobody reads it). The final IR is just ret i32 10.

That’s the shape of the transformation. The rest of this post is how SROA actually gets there.

The main loop

SROA is a function pass. Its outer loop (SROA.cpp:6034-6057) is a classic worklist:

do {
  while (!Worklist.empty()) {
    auto [IterationChanged, IterationCFGChanged] =
        runOnAlloca(*Worklist.pop_back_val());
    Changed |= IterationChanged;
    CFGChanged |= IterationCFGChanged;
    Changed |= deleteDeadInstructions(DeletedAllocas);
    if (!DeletedAllocas.empty()) {
      Worklist.set_subtract(DeletedAllocas);
      PostPromotionWorklist.set_subtract(DeletedAllocas);
      PromotableAllocas.set_subtract(DeletedAllocas);
      DeletedAllocas.clear();
    }
  }
  Changed |= promoteAllocas();
  Worklist = PostPromotionWorklist;
  PostPromotionWorklist.clear();
} while (!Worklist.empty());

The loop:

  1. Takes an alloca from the worklist and processes it with runOnAlloca.
  2. When the queue drains, calls promoteAllocas — which hands everything marked “promotable” to mem2reg.
  3. New opportunities may appear after promotion (e.g., a struct alloca whose field was itself a struct); those go in PostPromotionWorklist and get a fresh pass.

This two-phase structure — slice/rewrite, then promote, then maybe slice again — is why SROA sometimes fires multiple times on deeply nested aggregates.

Step 1: build slices

The first real work is describing every access to the alloca as a byte range. That’s a Slice (SROA.cpp:526-592):

class Slice {
  uint64_t BeginOffset = 0;
  uint64_t EndOffset = 0;
  PointerIntPair<Use *, 1, bool> UseAndIsSplittable;
  Value *ProtectedFieldDisc;

  Slice(uint64_t BeginOffset, uint64_t EndOffset, Use *U, bool IsSplittable,
        Value *ProtectedFieldDisc)
      : BeginOffset(BeginOffset), EndOffset(EndOffset),
        UseAndIsSplittable(U, IsSplittable),
        ProtectedFieldDisc(ProtectedFieldDisc) {}
};

Each slice records:

To build the slice list, SROA walks every reachable use of the alloca pointer. That walk is implemented as a visitor subclass of PtrUseVisitor called SliceBuilder (SROA.cpp:1033-1470). Each relevant instruction contributes slices:

void visitLoadInst(LoadInst &LI) {
  if (!IsOffsetKnown)
    return PI.setEscapedReadOnly(&LI);

  TypeSize Size = DL.getTypeStoreSize(LI.getType());
  // ... scalable-vector handling omitted ...
  return handleLoadOrStore(LI.getType(), LI, Offset, Size.getFixedValue(),
                           LI.isVolatile());
}

For a load i32, ptr %gep.x at offset 0, this produces a slice [0, 4) pointing at the load’s pointer operand. For the load at offset 4, it produces a slice [4, 8). A memcpy(dst, src, 8) where dst is the alloca produces a single slice [0, 8) marked splittable (because the memcpy can later be broken into per-partition pieces).

When the walk finishes, the slices are sorted by offset and stored in AllocaSlices (SROA.cpp:1454-1477):

AllocaSlices::AllocaSlices(const DataLayout &DL, AllocaInst &AI) {
  SliceBuilder PB(DL, AI, *this);
  SliceBuilder::PtrInfo PtrI = PB.visitPtr(AI);
  if (PtrI.isEscaped() || PtrI.isAborted()) {
    PointerEscapingInstr = PtrI.getEscapingInst() ? PtrI.getEscapingInst()
                                                  : PtrI.getAbortingInst();
    return;
  }
  // ...
  llvm::erase_if(Slices, [](const Slice &S) { return S.isDead(); });
  llvm::stable_sort(Slices);
}

The isEscaped() check is crucial: if the alloca’s address got stored somewhere, passed to an unknown function, or used in a way SROA can’t follow, the alloca “escapes” and SROA bails out entirely. You can’t safely split an alloca whose address is observable from outside the function — some other code might be holding a pointer that depends on the struct layout.

Step 2: form partitions

With slices in hand, SROA groups them into partitions — contiguous regions of the alloca that can’t be split apart because they have overlapping accesses (SROA.cpp:743-806):

class Partition {
  uint64_t BeginOffset = 0, EndOffset = 0;
  iterator SI, SJ;
  SmallVector<Slice *, 4> SplitTails;

  uint64_t size() const {
    assert(BeginOffset < EndOffset);
    return EndOffset - BeginOffset;
  }
};

A partition is a window [Begin, End) plus the subset of slices that fall entirely inside it. If two slices overlap — say, a store i64 at offset 0 and a load i32 at offset 4 — they end up in the same partition, because you can’t split them cleanly into separate allocas.

For the Point example above, the slices are [0, 4) (store to x), [4, 8) (store to y), and [0, 4) (load from x). They naturally form two partitions: [0, 4) and [4, 8). Each partition will become its own new alloca.

Step 3: rewrite

For each partition, rewritePartition (SROA.cpp:5234-5360) creates a new, smaller alloca and walks every slice in the partition to update its IR to point at the new alloca instead of the old one:

std::pair<AllocaInst *, uint64_t>
SROA::rewritePartition(AllocaInst &AI, AllocaSlices &AS, Partition &P) {
  const DataLayout &DL = AI.getDataLayout();
  auto [PartitionTy, IsIntegerWideningViable, VecTy] =
      selectPartitionType(P, DL, AI, *C);

  AllocaInst *NewAI;
  if (PartitionTy == AI.getAllocatedType() && P.beginOffset() == 0) {
    NewAI = &AI;
  } else {
    const Align Alignment = commonAlignment(AI.getAlign(), P.beginOffset());
    NewAI = new AllocaInst(
        PartitionTy, AI.getAddressSpace(), nullptr,
        /* alignment logic */,
        AI.getName() + ".sroa." + Twine(P.begin() - AS.begin()),
        AI.getIterator());
    ++NumNewAllocas;
  }

  AllocaSliceRewriter Rewriter(DL, AS, *this, AI, *NewAI, PartitionTy, ...);
  // walk slices, rewrite each
}

The heavy lifting is in AllocaSliceRewriter (SROA.cpp:2721-3000+), which is an InstVisitor — one visit method per instruction kind. When it encounters a StoreInst whose pointer operand is a GEP into the original alloca, it computes the offset of that access relative to the new alloca’s base and rewrites the operand. A LoadInst gets the same treatment. A MemCpyInst that straddles partitions gets split into two smaller memcpys, one per partition. And so on.

The critical question the rewriter keeps asking is: “after I rewrite this partition, are all the uses of the new alloca amenable to promotion?” That means no pointer-escape, no weird type-punning, no address-taken PHI nodes SROA can’t reason about. If the answer is yes, the new alloca gets added to PromotableAllocas.

Step 4: selecting a type for the new alloca

When rewritePartition creates the new alloca, it needs to pick a type. This is selectPartitionType (SROA.cpp:5145-5221):

auto [CommonUseTy, LargestIntTy] =
    findCommonType(P.begin(), P.end(), P.endOffset());
if (CommonUseTy) {
  TypeSize CommonUseSize = DL.getTypeAllocSize(CommonUseTy);
  if (CommonUseSize.isFixed() && CommonUseSize.getFixedValue() >= P.size()) {
    if (VecTy)
      return {VecTy, false, VecTy};
    return {CommonUseTy, isIntegerWideningViable(P, CommonUseTy, DL),
            nullptr};
  }
}

SROA looks at the types of every access in the partition. If they all use the same type — say, every load and store in [0, 4) is an i32 — it picks i32. If some are i32 and some are float, it has to reason about which is the “canonical” type for that region. Worst case (types incompatible), it falls back to an [N x i8] and gives up on promoting this partition, but correctness is preserved.

This is also where a partition gets marked for integer widening — a trick where something like [0, 4) accessed as both i32 and as two i16s gets rewritten to one i32 alloca, with the i16 accesses turned into shifts and masks. That keeps the alloca promotable.

Step 5: hand to mem2reg

After every partition is rewritten, SROA calls promoteAllocas (SROA.cpp:5996-6010):

bool SROA::promoteAllocas() {
  if (PromotableAllocas.empty())
    return false;

  if (SROASkipMem2Reg) {
    LLVM_DEBUG(dbgs() << "Not promoting allocas with mem2reg!\n");
  } else {
    LLVM_DEBUG(dbgs() << "Promoting allocas with mem2reg...\n");
    NumPromoted += PromotableAllocas.size();
    PromoteMemToReg(PromotableAllocas.getArrayRef(), DTU->getDomTree(), AC);
  }

  PromotableAllocas.clear();
  return true;
}

PromoteMemToReg (from llvm/Transforms/Utils/PromoteMemToReg.h) is the classic mem2reg algorithm: insert phi nodes at dominance frontiers, replace each load with the incoming SSA value, and delete the alloca. It only works on allocas whose loads and stores each cover the entire allocation — which, thanks to SROA’s slicing, is exactly the property the new allocas have.

This is the key architectural idea: SROA doesn’t try to do SSA construction itself. It just massages messy aggregate allocas into clean scalar allocas, then defers to the existing, well-tested mem2reg machinery.

A tiny worked example

Here’s the simplest test from llvm/test/Transforms/SROA/basictest.ll:

define i32 @test1() {
entry:
  %X = alloca { i32, float }
  %Y = getelementptr { i32, float }, ptr %X, i64 0, i32 0
  store i32 0, ptr %Y
  %Z = load i32, ptr %Y
  ret i32 %Z
}

Trace through it:

  1. Slicing. The GEP at offset 0 is followed by a store and a load of i32. Two slices both at [0, 4), both use type i32. Nothing touches the float field.
  2. Partitioning. One partition covers [0, 4). The float field is in dead space that nobody reads, so it generates no partition.
  3. Rewriting. A new alloca i32 replaces the { i32, float } alloca. The GEP is folded away (it’s now GEP-identity). The store and load target the new alloca directly.
  4. Promotion. The new alloca has only full-width loads and stores, all in the same block, no escapes. mem2reg trivially replaces the load with the stored value (0) and deletes the alloca.
  5. Result: ret i32 0.

The entire stack allocation, GEP, store, and load vanish. This is why SROA is one of the first optimization passes after mem2reg — it’s a huge setup step for everything downstream.

What SROA refuses to do

The safety side of SROA is almost more interesting than the aggressive side. A few things will cause it to bail out:

These aren’t arbitrary limits — each corresponds to something in the C/C++ object model that would be broken by naively separating out fields.

Why the pipeline ordering matters

SROA runs early, right after mem2reg on the first round. The reason is that almost every downstream pass — GVN, InstCombine, LICM, LoopVectorize — assumes it’s working with scalars rather than aggregate allocas. If you left a struct field access as a getelementptr into a live alloca, those passes would have to do alias analysis through the GEP, which they mostly don’t.

So SROA’s job isn’t really “optimization” in the traditional sense. It’s canonicalization: making sure that every small piece of memory that could be a register is a register. The actual speedups come from the passes that run afterward on clean SSA.

The mental model I find useful: SROA is the plumbing that connects the frontend’s “variables live in memory, with named fields” view of the world to the optimizer’s “variables are SSA values flowing through a dataflow graph” view. Without it, most of LLVM’s middle-end would be blind to struct fields.