How LLVM's MemCpyOpt Eliminates Memory Copies || Packed Bits

Most people discover MemCpyOpt the same way: they look at the IR after opt -O2, notice that a chain of stores got fused into a single llvm.memset, and wonder which pass did that. The answer is MemCpyOpt — a single function pass that handles a grab-bag of optimizations around memcpy, memset, stores, and function-argument passing.

The pass is small by LLVM standards (about 2,200 lines) but it does a surprising amount of heavy lifting. It’s responsible for:

fusing adjacent stores into memset,
forwarding chained memcpys so the intermediate buffer disappears,
turning memcpy into memset when the source is a constant byte,
eliminating the copy in C++’s “return a struct by value” pattern (the famous call-slot optimization, which is what lets LLVM match what most people call NRVO),
and collapsing two stack allocas into one when only a memcpy connects them.

This post walks through each of those in turn. All line numbers refer to llvm/lib/Transforms/Scalar/MemCpyOptimizer.cpp (the header is llvm/include/llvm/Transforms/Scalar/MemCpyOptimizer.h). Test cases come from llvm/test/Transforms/MemCpyOpt/.

Where the pass lives

MemCpyOpt is a function pass (MemCpyOptimizer.h:23-53) that depends on several analyses:

class MemCpyOptPass : public PassInfoMixin<MemCpyOptPass> {
  TargetLibraryInfo *TLI = nullptr;
  AAResults *AA = nullptr;
  AssumptionCache *AC = nullptr;
  DominatorTree *DT = nullptr;
  PostDominatorTree *PDT = nullptr;
  MemorySSA *MSSA = nullptr;
  MemorySSAUpdater *MSSAU = nullptr;
  EarliestEscapeAnalysis *EEA = nullptr;
};

The four analyses worth knowing about before we dive in:

AAResults — alias analysis, used to answer “could these two pointers overlap?”. TBAA (the subject of my previous post) is one of its providers.
MemorySSA — a sparse representation of memory dependencies. Given a load, it can answer “which store last wrote to this location?” in O(1) on average. MemCpyOpt uses it constantly.
DominatorTree / PostDominatorTree — “does instruction A always execute before B?” sort of queries. Needed for safety proofs.
EarliestEscapeAnalysis — tracks where a pointer might first escape the current function (be captured, passed to unknown code, etc.).

The pass runs via run() (MemCpyOptimizer.cpp:2228) which delegates to runImpl() (MemCpyOptimizer.cpp:2246):

PreservedAnalyses MemCpyOptPass::run(Function &F, FunctionAnalysisManager &AM) {
  // ... obtain analyses ...
  bool MadeChange = runImpl(F, &TLI, AA, AC, DT, PDT, &MSSA->getMSSA());
  if (!MadeChange)
    return PreservedAnalyses::all();

  PreservedAnalyses PA;
  PA.preserveSet<CFGAnalyses>();
  PA.preserve<MemorySSAAnalysis>();
  return PA;
}

runImpl loops on iterateOnFunction until nothing changes — a classic fixed-point. Inside, iterateOnFunction (MemCpyOptimizer.cpp:2181-2226) is the dispatcher:

for (BasicBlock &BB : F) {
  if (!DT->isReachableFromEntry(&BB)) continue;

  for (BasicBlock::iterator BI = BB.begin(), BE = BB.end(); BI != BE;) {
    Instruction *I = &*BI++;
    bool RepeatInstruction = false;

    if (auto *SI = dyn_cast<StoreInst>(I))
      MadeChange |= processStore(SI, BI);
    else if (auto *M = dyn_cast<MemSetInst>(I))
      RepeatInstruction = processMemSet(M, BI);
    else if (auto *M = dyn_cast<MemCpyInst>(I))
      RepeatInstruction = processMemCpy(M, BI);
    else if (auto *M = dyn_cast<MemMoveInst>(I))
      RepeatInstruction = processMemMove(M, BI);
    else if (auto *CB = dyn_cast<CallBase>(I)) {
      for (unsigned i = 0; i != e; ++i) {
        if (CB->isByValArgument(i))
          MadeChange |= processByValArgument(*CB, i);
        else if (CB->onlyReadsMemory(i))
          MadeChange |= processImmutArgument(*CB, i);
      }
    }
    // ...
  }
}

Every transformation below is hanging off one of these process* entry points.

Optimization 1: fusing adjacent stores into `memset`

The simplest transformation to visualize. Consider this function from test/Transforms/MemCpyOpt/form-memset.ll:

; 19 sequential stores of the same byte value
store i8 %c, ptr %tmp,  align 1
store i8 %c, ptr %tmp5, align 1
store i8 %c, ptr %tmp9, align 1
; ... 16 more ...
store i8 %c, ptr %tmp73, align 1

After MemCpyOpt:

call void @llvm.memset.p0.i64(ptr align 1 %tmp, i8 %c, i64 19, i1 false)

All 19 stores collapsed into one memset.

The work happens in tryMergingIntoMemset (MemCpyOptimizer.cpp:352-501). The core loop scans forward from some starting store, trying to accumulate adjacent stores of the same byte value:

MemsetRanges Ranges(DL);

BasicBlock::iterator BI(StartInst);
MemoryUseOrDef *MemInsertPoint = nullptr;
for (++BI; !BI->isTerminator(); ++BI) {
  auto *CurrentAcc = cast_or_null<MemoryUseOrDef>(MSSA->getMemoryAccess(&*BI));
  if (CurrentAcc)
    MemInsertPoint = CurrentAcc;

  if (!isa<StoreInst>(BI) && !isa<MemSetInst>(BI)) {
    if (BI->mayWriteToMemory() || BI->mayReadFromMemory())
      break;       // any other memory op -> stop
    continue;      // harmless instruction, keep looking
  }

  if (auto *NextStore = dyn_cast<StoreInst>(BI)) {
    Value *StoredByte = isBytewiseValue(StoredVal, DL);
    if (ByteVal != StoredByte) break;

    std::optional<int64_t> Offset =
        NextStore->getPointerOperand()->getPointerOffsetFrom(StartPtr, DL);
    if (!Offset) break;

    Ranges.addStore(*Offset, NextStore);
  }
}

Two pieces deserve a second look.

isBytewiseValue(StoredVal, DL) asks “is every byte of this value the same?”. A constant like 0x42 trivially yields 0x42. A constant like 0x4242424242424242 (an i64) also yields 0x42. A non-constant like %c (an i8 variable) yields %c itself — because an i8 has only one byte. This is the property memset needs: a single byte pattern repeated.

MemsetRanges (MemCpyOptimizer.cpp:91-199) is a small interval-merging data structure that tracks which byte ranges have been stored to. When the scan finishes, it checks whether any range is worth materializing as a memset — the heuristic is roughly “at least 4 stores or at least 16 bytes” (see isProfitableToUseMemset).

Why this is safe. The loop bails out as soon as it sees any instruction that reads or writes memory other than the stores it’s collecting. So by construction, nothing between the first and last store can observe the intermediate state — collapsing them into one bulk store is invisible to the rest of the program.

Optimization 2: `load; store` → `memcpy`

When you write *dst = *src for a large aggregate in C++, Clang emits a load of the struct followed by a store. MemCpyOpt turns that into a proper memcpy so the backend can use block-copy instructions.

The function is processStoreOfLoad (MemCpyOptimizer.cpp:631-743). The core builder code (MemCpyOptimizer.cpp:674-684):

IRBuilder<> Builder(P);
Value *Size = Builder.CreateTypeSize(Builder.getInt64Ty(),
                                     DL.getTypeStoreSize(T));
Instruction *M;
if (UseMemMove)
  M = Builder.CreateMemMove(SI->getPointerOperand(), SI->getAlign(),
                            LI->getPointerOperand(), LI->getAlign(), Size);
else
  M = Builder.CreateMemCpy(SI->getPointerOperand(), SI->getAlign(),
                           LI->getPointerOperand(), LI->getAlign(), Size);

Notice the UseMemMove flag: if alias analysis can’t prove src and dst don’t overlap, the pass emits memmove instead of memcpy. memcpy requires disjoint regions; memmove handles overlap correctly. This “pick the weaker intrinsic when we can’t prove the stronger one” pattern shows up everywhere in MemCpyOpt.

Safety. The load must have a single use (the store). The load and store must be in the same basic block, and nothing between them may write to the source location. All of this is checked via alias queries and MemorySSA walks before the rewrite is emitted.

Optimization 3: forwarding chained `memcpy`s

This is the one you’ll encounter most in real C++ code, because it’s what cleans up after temporaries. From test/Transforms/MemCpyOpt/memcpy.ll:

; Before
%memtmp = alloca %0, align 16
call void @llvm.memcpy.p0.p0.i32(ptr align 16 %memtmp, ptr align 16 %P, i32 32, i1 false)
call void @llvm.memcpy.p0.p0.i32(ptr align 16 %Q, ptr align 16 %memtmp, i32 32, i1 false)

The first copy fills a temporary alloca; the second copies that temporary to %Q. MemCpyOpt rewrites the second copy to read directly from %P:

; After
call void @llvm.memmove.p0.p0.i32(ptr align 16 %Q, ptr align 16 %P, i32 32, i1 false)

(The intermediate alloca and first memcpy get cleaned up by later passes like DSE once nothing reads from them.)

The transformation is in processMemCpyMemCpyDependence (MemCpyOptimizer.cpp:1102-1267). The interesting bit is the offset handling — the second memcpy might not start at the beginning of the first memcpy’s destination:

IRBuilder<> Builder(M);
auto *CopySource = MDep->getSource();
Instruction *NewCopySource = nullptr;

if (MForwardOffset > 0) {
  std::optional<int64_t> MDestOffset =
      M->getRawDest()->getPointerOffsetFrom(MDep->getRawSource(), DL);
  if (MDestOffset == MForwardOffset)
    CopySource = M->getDest();
  else {
    CopySource = Builder.CreateInBoundsPtrAdd(
        CopySource, Builder.getInt64(MForwardOffset));
    NewCopySource = dyn_cast<Instruction>(CopySource);
  }
}
if (writtenBetween(MSSA, BAA, MCopyLoc, MSSA->getMemoryAccess(MDep),
                   MSSA->getMemoryAccess(M)))
  return false;
Instruction *NewM = UseMemMove
    ? Builder.CreateMemMove(...)
    : Builder.CreateMemCpy(...);

The offset case is tested in test/Transforms/MemCpyOpt/memcpy-memcpy-offset.ll:

; Before
call void @llvm.memcpy.p0.p0.i64(ptr %cpy_tmp, ptr %src, i64 7, i1 false)
%cpy_tmp_offset = getelementptr inbounds i8, ptr %cpy_tmp, i64 1
call void @llvm.memcpy.p0.p0.i64(ptr %dest, ptr %cpy_tmp_offset, i64 6, i1 false)

; After
call void @llvm.memcpy.p0.p0.i64(ptr %cpy_tmp, ptr %src, i64 7, i1 false)
%src.offset = getelementptr inbounds i8, ptr %src, i64 1
call void @llvm.memmove.p0.p0.i64(ptr %dest, ptr %src.offset, i64 6, i1 false)

The second memcpy was reading bytes 1..7 from %cpy_tmp, which the first memcpy filled from %src. So it’s equivalent to reading bytes 1..7 from %src — hence the new GEP of src + 1.

Safety. The critical call is writtenBetween(...): we need to prove that nothing modifies %src (the original source) between the first copy and the second. If something did, the second copy would see stale data after forwarding. MemorySSA makes this a cheap walk.

Optimization 4: `memcpy` from constant → `memset`

A neat little folding. If the source of a memcpy is memory that we know was filled with a repeating byte, the copy itself can become a memset.

The function is performMemCpyToMemSetOptzn (MemCpyOptimizer.cpp:1429-1502):

IRBuilder<> Builder(MemCpy);
Value *DestPtr = MemCpy->getRawDest();
MaybeAlign Align = MemCpy->getDestAlign();
if (MOffset < 0) {
  DestPtr = Builder.CreatePtrAdd(DestPtr, Builder.getInt64(-MOffset));
  if (Align)
    Align = commonAlignment(*Align, -MOffset);
}

Instruction *NewM = Builder.CreateMemSet(DestPtr, MemSet->getOperand(1),
                                         CopySize, Align);
auto *LastDef = cast<MemoryDef>(MSSA->getMemoryAccess(MemCpy));
auto *NewAccess = MSSAU->createMemoryAccessAfter(NewM, nullptr, LastDef);
MSSAU->insertDef(cast<MemoryDef>(NewAccess), /*RenameUses=*/true);

Nothing fancy in the builder; the work is in proving the preconditions. The source must be a single-byte repeated pattern (either an earlier memset, or a constant that isBytewiseValue accepts), and the copy must not be larger than that initialized region. When it fires, the result is:

; Before: copying from a zero-initialized source
call void @llvm.memset.p0.i64(ptr %src, i8 0, i64 32, ...)
call void @llvm.memcpy.p0.p0.i64(ptr %dst, ptr %src, i64 32, ...)

; After: direct memset to dst
call void @llvm.memset.p0.i64(ptr %src, i8 0, i64 32, ...)
call void @llvm.memset.p0.i64(ptr %dst, i8 0, i64 32, ...)

Later passes will often notice the original memset into %src is dead and remove it.

Optimization 5: call-slot optimization

This is the most important transformation in the pass. It’s what makes C++ return-by-value cheap. Conceptually:

// C++ source
Big f();
void g() {
  Big x = f();
}

Clang lowers this to something like:

%tmp = alloca %struct.Big        ; temporary for f's return value
call void @f(ptr sret %tmp)      ; f writes into %tmp via the sret parameter
%x   = alloca %struct.Big        ; x's storage
call void @llvm.memcpy.p0.p0.i64(ptr %x, ptr %tmp, i64 N, i1 false)

Call-slot optimization notices that %tmp is only used as the destination of f’s sret write and as the source of the memcpy, so we can just hand %x to f directly:

%x = alloca %struct.Big
call void @f(ptr sret %x)        ; f writes into x directly

No temporary, no copy. This is roughly what C++ calls NRVO; here it’s being done by the optimizer regardless of whether the frontend performed it.

The work is in performCallSlotOptzn (MemCpyOptimizer.cpp:842-1098). It’s 250 lines, almost all safety checks. The ones worth understanding:

(1) The source is a fresh alloca of exactly the right size. If the alloca is smaller than the copy, the function might have written beyond what the memcpy reads, and redirecting it to %x would stomp memory outside %x. (MemCpyOptimizer.cpp:867-878)

(2) No one accesses the destination between the call and the memcpy. If somebody reads or writes %x in between, the destination has a different value after the optimization than before. (MemCpyOptimizer.cpp:903-907)

if (accessedBetween(BAA, DestLoc, MSSA->getMemoryAccess(C),
                    MSSA->getMemoryAccess(cpyStore), &SkippedLifetimeStart)) {
  LLVM_DEBUG(dbgs() << "Call Slot: Dest pointer modified after call\n");
  return false;
}

(3) The destination is writable and dereferenceable for the full size. Otherwise redirecting the call could trap. (MemCpyOptimizer.cpp:923-929)

if (!isWritableObject(getUnderlyingObject(cpyDest),
                      ExplicitlyDereferenceableOnly) ||
    !isDereferenceableAndAlignedPointer(cpyDest, Align(1),
                                        APInt(64, cpySize), DL, C, AC, DT)) {
  return false;
}

(4) The source alloca is used only by the call and the memcpy. (MemCpyOptimizer.cpp:964-979)

SmallVector<User *, 8> srcUseList(srcAlloca->users());
while (!srcUseList.empty()) {
  User *U = srcUseList.pop_back_val();
  if (isa<AddrSpaceCastInst>(U)) {
    append_range(srcUseList, U->users());
    continue;
  }
  if (isa<LifetimeIntrinsic>(U)) continue;
  if (U != C && U != cpyLoad) {
    return false;   // someone else is looking at this alloca, abort
  }
}

If anything else observes the temporary (a debug intrinsic storing its address, a load from it, an escape), we can’t redirect safely.

(5) The call doesn’t access the destination for some other reason. Consider f(sret *ret, Big *other) — if we redirect ret to %x but the call already had %x as its other argument, we’ve just created aliasing that wasn’t there before. (MemCpyOptimizer.cpp:1047-1053)

MemoryLocation DestWithSrcSize(cpyDest, LocationSize::precise(srcSize));
ModRefInfo MR = BAA.getModRefInfo(C, DestWithSrcSize);
if (isModOrRefSet(MR))
  MR = BAA.callCapturesBefore(C, DestWithSrcSize, DT);
if (isModOrRefSet(MR)) return false;

A test case from test/Transforms/MemCpyOpt/callslot.ll shows the optimization firing even when the destination is a GEP of another alloca:

; Before
%dest    = alloca [16 x i8]
%src     = alloca [8 x i8]
%dest.i8 = getelementptr [16 x i8], ptr %dest, i64 0, i64 8
call void @accept_ptr(ptr %src) nounwind
call void @llvm.memcpy.p0.p0.i64(ptr %dest.i8, ptr %src, i64 8, i1 false)

; After
%dest    = alloca [16 x i8]
%src     = alloca [8 x i8]             ; now unused, will be DSE'd later
%dest.i8 = getelementptr [16 x i8], ptr %dest, i64 0, i64 8
call void @accept_ptr(ptr %dest.i8) nounwind
ret void

The call now writes directly into the second half of %dest, and the memcpy is gone. That’s one stack alloca worth of memory and one full memcpy saved — per call.

Optimization 6: stack-move — merging two allocas

Sometimes you end up with two allocas where one is only alive long enough to be memcpy’d into the other. Example from test/Transforms/MemCpyOpt/stack-move-offset.ll:

%src = alloca [16 x i8], align 4
%dest = alloca [8 x i8], align 8
%src.gep = getelementptr inbounds i8, ptr %src, i64 8
store i64 42, ptr %src.gep
call void @use_nocapture(ptr %src.gep)
call void @llvm.memcpy.p0.p0.i64(ptr %dest, ptr %src.gep, i64 8, i1 false)
call void @use_nocapture(ptr %dest)

%dest is just a mirror of the upper half of %src. The stack-move optimization merges them — every use of %dest is rewritten to a GEP of %src, and %dest is deleted:

%src = alloca [16 x i8], align 8        ; alignment raised to max of both
%src.gep = getelementptr inbounds i8, ptr %src, i64 8
store i64 42, ptr %src.gep
call void @use_nocapture(ptr %src.gep)
; memcpy gone
call void @use_nocapture(ptr %src.gep)  ; what used to be "use %dest"

The function performStackMoveOptzn (MemCpyOptimizer.cpp:1516-1777) is the longest in the pass. A sketch of the preconditions (MemCpyOptimizer.cpp:1548-1569):

auto DestOffset = DestPtr->getPointerOffsetFrom(DestAlloca, DL);
if (!DestOffset) return false;
auto SrcOffset = SrcPtr->getPointerOffsetFrom(SrcAlloca, DL);
if (!SrcOffset || *SrcOffset < *DestOffset || *SrcOffset < 0)
  return false;
// Offset difference must preserve dest alloca's alignment
if ((*SrcOffset - *DestOffset) % DestAlloca->getAlign().value() != 0)
  return false;
// Copy size must equal dest alloca size
if (Size != *DestSize || *DestOffset != 0) {
  return false;
}

And the rewrite itself (MemCpyOptimizer.cpp:1723-1777):

if (MoveSrc)
  SrcAlloca->moveBefore(DestAlloca->getIterator());
SrcAlloca->setAlignment(
    std::max(SrcAlloca->getAlign(), DestAlloca->getAlign()));
// ...
Value *NewDestPtr = SrcAlloca;
if (*SrcOffset != *DestOffset) {
  IRBuilder<> Builder(DestAlloca);
  NewDestPtr = Builder.CreateInBoundsPtrAdd(
      SrcAlloca, Builder.getInt64(*SrcOffset - *DestOffset));
}
DestAlloca->replaceAllUsesWith(NewDestPtr);
eraseInstruction(DestAlloca);

Why is the alignment tweak necessary? The merged alloca now has to satisfy the alignment requirements of both original allocas. So %src’s alignment is raised to max(src_align, dest_align) — 4 became 8 in the test above.

Safety. The really subtle requirement is that neither alloca’s address escapes the function in a way that could be inspected. If some other code has captured a pointer to %dest, you can’t merge it with %src without changing what that pointer observes. The pass spends most of its safety budget (MemCpyOptimizer.cpp:1577-1721) in a CaptureTrackingWithModRef lambda that walks every user of both allocas and verifies none of them escape or produce a mod/ref conflict.

Optimization 7: byval argument forwarding

When you pass a struct by value in C, the ABI often requires the callee to see its own copy. At the IR level, byval arguments do this implicitly. If the caller memcpy’d some source into a stack temporary just to pass it to a byval parameter, MemCpyOpt can often remove that copy and pass the original directly:

; Before
%a = alloca %struct.S
call void @llvm.memcpy.p0.p0.i64(ptr %a, ptr %b, i64 N, i1 false)
call void @g(ptr byval(%struct.S) %a)

; After
call void @g(ptr byval(%struct.S) %b)

The function g will still receive its own private copy (that’s what byval means), so the observable semantics are preserved — we’re just letting the calling convention do the copy instead of the IR.

The implementation is processByValArgument (MemCpyOptimizer.cpp:2007-2074). The key safety check is the familiar “nothing wrote to %b between the memcpy and the call” test (MemCpyOptimizer.cpp:2061-2063):

if (writtenBetween(MSSA, BAA, MemoryLocation::getForSource(MDep),
                   MSSA->getMemoryAccess(MDep), CallAccess))
  return false;

There’s a matching transformation processImmutArgument (MemCpyOptimizer.cpp:2090-2168) for parameters marked readonly nocapture, where the same logic applies: the callee promises not to write, so we can skip the defensive copy.

The common safety pattern

Every one of these transformations relies on the same shape of safety argument:

Between the “source” event (a memset, a memcpy, a call that writes) and the “destination” event (the memcpy or use we’re rewriting), no other memory operation can interfere.

That question is exactly what MemorySSA was built to answer cheaply. The helper writtenBetween — appearing in processMemCpyMemCpyDependence, performMemCpyToMemSetOptzn, processByValArgument, and others — wraps an MSSA clobber walk with a fallback to alias analysis. Without MemorySSA, many of these transformations would be O(n²) to check and would bail out on anything more complex than a toy test case.

The other shared pattern is “use memmove when you can’t prove memcpy”. memcpy requires disjoint source and destination; memmove doesn’t. When MemCpyOpt can’t prove disjointness it simply emits memmove, secure in the knowledge that if AA later gets smarter, a separate pass (InstCombine) will narrow it back to memcpy.

What to read next

If you want to keep going, the test directory llvm/test/Transforms/MemCpyOpt/ has about 120 .ll files, each exercising one specific corner case. memcpy.ll, form-memset.ll, callslot.ll, and stack-move-offset.ll are the best starting points — between them they cover most of what this post discussed. Read one, then open the corresponding process* or perform* function and trace through it. The combination of small IR and well-named safety checks makes this one of the more pleasant optimization passes to study.

The seven transformations above are also a useful reference point for understanding what other optimization passes can rely on. By the time your IR reaches LoopVectorize or GVN, MemCpyOpt has already turned your store-of-load into a memcpy and your chained copies into single ones — and a lot of downstream logic is built assuming that canonicalization has happened.

How LLVM's MemCpyOpt Eliminates Memory Copies