InstCombine: The Worklist, the Folds, and the 'Net Simplification' Rule || Packed Bits

If you’ve ever looked at LLVM IR before and after -O2 and wondered how %z = add i32 %x, %x became %z = shl i32 %x, 1, or how a chain of add i32 %n, 1; add i32 %prev, 1; ... got collapsed into a single add, the answer is InstCombine. It’s one of the pipeline’s most-run passes and it does something conceptually simple: walk a function, match patterns, and rewrite them into something cheaper.

The devil is in the details. InstCombine has to avoid infinite loops while being aggressive, it has to handle hundreds of pattern-specific folds without turning into spaghetti, and it has to integrate with value tracking so folds can depend on known-bits properties. This post walks through each of those.

All line numbers refer to files under llvm/lib/Transforms/InstCombine/. The main driver is InstructionCombining.cpp; the pattern library is llvm/include/llvm/IR/PatternMatch.h.

The opening comment

Before anything else, read the file header of InstructionCombining.cpp (lines 9-31):

// InstructionCombining - Combine instructions to form fewer, simple
// instructions. This pass does not modify the CFG. This pass is where
// algebraic simplification happens.
//
// This pass combines things like:
//    %Y = add i32 %X, 1
//    %Z = add i32 %Y, 1
// into:
//    %Z = add i32 %X, 2
//
// This is a simple worklist driven algorithm.

Two promises in that comment are load-bearing:

“Does not modify the CFG.” Unlike GVN and LICM, InstCombine never adds, removes, or reroutes basic blocks. It only rewrites instructions in place. This property is what makes it safe to run very often — other passes don’t have to redo CFG-based analyses.
“Worklist driven algorithm.” The design is a classic worklist: pop an instruction, try to fold it, and if the fold succeeds, add affected instructions back. This pattern is everywhere in compilers, but InstCombine has a couple of wrinkles worth understanding.

The main loop

The core of the pass is InstCombinerImpl::run (InstructionCombining.cpp:5763-5950). Stripped to its skeleton:

bool InstCombinerImpl::run() {
  while (!Worklist.isEmpty()) {
    while (Instruction *I = Worklist.popDeferred()) {
      if (isInstructionTriviallyDead(I, &TLI)) {
        eraseInstFromFunction(*I);
        ++NumDeadInst;
        continue;
      }
      Worklist.push(I);
    }

    Instruction *I = Worklist.removeOne();
    if (I == nullptr) continue;

    // ... visit(*I) ...
  }
}

The “deferred” and “main” lists deserve explanation.

Two queues: `Deferred` and `Worklist`

InstructionWorklist (llvm/include/llvm/Transforms/Utils/InstructionWorklist.h:25-130) has two storage areas:

Deferred — instructions added via add(I). These accumulate during the processing of one instruction. They’re drained into the main stack in reverse order between main-loop iterations.
Worklist — instructions added via push(I). LIFO. This is the actual “what do I process next” stack.

Why two? The motivation is ordering. During a single fold, you might touch lots of instructions: the one being folded, its operands, its users. If all of them go directly onto the main stack, the order of processing becomes unpredictable. Instead, they go to Deferred, which gets drained in a controlled order (reversed, so they come out FIFO relative to each other) between top-level iterations.

In practice this means: when a fold creates a new instruction, both it and its users go on the worklist:

// Inside the run loop, after a successful fold
Worklist.pushUsersToWorkList(*Result);
Worklist.push(Result);

So chains of folds propagate cleanly: fold add (add X, 1) 1 into add X, 2, the users of the original outer add are now staring at add X, 2 and get a chance to combine further.

Reverse post-order initialization

The worklist isn’t populated arbitrarily. prepareWorklist (InstructionCombining.cpp:6009-6140) iterates the function in reverse post-order:

for (BasicBlock *BB : RPOT) {
  // collect instructions, folding constants and removing dead edges
}

The reason: we want to process an instruction after its operands have been constant-folded and canonicalized. Reverse post-order visits blocks such that each block is visited before blocks it can reach — so defs before uses in the dominator-tree sense. Constants propagate naturally as we walk.

The unwritten “net simplification” rule

This is the single most important invariant of InstCombine, and it’s documented in the header comment by example. Every fold must strictly reduce instruction count or produce a clearly better canonical form — never merely shuffle between equivalent shapes.

Why? Consider two possible folds:

sub X, C → add X, -C (canonical form: constants on RHS)
add X, -C → sub X, C (maybe preferred for codegen?)

If InstCombine implemented both, it would oscillate forever. One fold undoes the other, the worklist never empties. The solution is to pick a canonical form and only fold toward it.

The opening comment of InstructionCombining.cpp (lines 22-31) even lists some canonicalization rules:

//    1. If a binary operator has a constant operand, it is moved to the RHS
//    2. Bitwise operators with constant operands are always grouped so that
//       shifts are performed first, then or's, then and's, then xor's.
//    3. Compare instructions are converted from <,>,<=,>= to ==,!= if possible
//    4. All cmp instructions on boolean values are replaced with logical ops
//    5. add X, X is represented as (X*2) => (X << 1)
//    6. Multiplies with a power-of-two constant argument are transformed into shifts.

These rules are the pass’s compass. If you’re adding a new fold, you have to know: does it produce fewer instructions? If not, does it produce a canonically-preferred form? If neither, you can’t add it.

PatternMatch: declarative IR matching

InstCombine’s folds are written against PatternMatch.h, a C++ template library for matching IR patterns declaratively. Instead of:

if (auto *BO = dyn_cast<BinaryOperator>(V)) {
  if (BO->getOpcode() == Instruction::Neg) {
    Value *Inner = BO->getOperand(0);
    if (auto *Add = dyn_cast<BinaryOperator>(Inner)) {
      if (Add->getOpcode() == Instruction::Add) {
        // ...
      }
    }
  }
}

you write:

Value *X, *Y;
if (match(V, m_Neg(m_Add(m_Value(X), m_Value(Y))))) {
  // X and Y are now bound to the add's operands
}

The underlying template machinery is in PatternMatch.h:1230-1277. A BinaryOp_match is essentially:

template <typename LHS_t, typename RHS_t, unsigned Opcode,
          bool Commutable = false>
struct BinaryOp_match {
  LHS_t L;
  RHS_t R;

  template <typename OpTy> inline bool match(unsigned Opc, OpTy *V) const {
    if (V->getValueID() == Value::InstructionVal + Opc) {
      auto *I = cast<BinaryOperator>(V);
      return (L.match(I->getOperand(0)) && R.match(I->getOperand(1))) ||
             (Commutable && L.match(I->getOperand(1)) &&
              R.match(I->getOperand(0)));
    }
    return false;
  }
};

And m_Add is just a convenience factory (PatternMatch.h:1256-1277):

template <typename LHS, typename RHS>
inline BinaryOp_match<LHS, RHS, Instruction::Add> m_Add(const LHS &L,
                                                        const RHS &R) {
  return BinaryOp_match<LHS, RHS, Instruction::Add>(L, R);
}

Hundreds of these matchers exist: m_Value, m_ConstantInt, m_APInt, m_Specific, m_OneUse, m_Neg, m_Not, m_Select, m_c_Add (commutative variant), and so on. Composed together, they make folds read almost like the mathematical identity they’re implementing.

A real fold, read in detail

Here’s a fragment of visitAdd from InstCombineAddSub.cpp:1583-1602:

Value *A, *B;
if (match(LHS, m_Neg(m_Value(A)))) {
  // -A + -B --> -(A + B)
  if (match(RHS, m_Neg(m_Value(B))))
    return BinaryOperator::CreateNeg(Builder.CreateAdd(A, B));

  // -A + B --> B - A
  auto *Sub = BinaryOperator::CreateSub(RHS, A);
  auto *OB0 = cast<OverflowingBinaryOperator>(LHS);
  Sub->setHasNoSignedWrap(I.hasNoSignedWrap() && OB0->hasNoSignedWrap());
  return Sub;
}

// A + -B  -->  A - B
if (match(RHS, m_Neg(m_Value(B)))) {
  auto *Sub = BinaryOperator::CreateSub(LHS, B);
  // ...
  return Sub;
}

Read it top-to-bottom:

If the LHS is a negation (-A) and the RHS is also a negation (-B), fold the pair into -(A+B). Net change: we had two negs and one add (3 ops); now we have one add and one neg (2 ops). Strict reduction.
If only the LHS is a negation, fold -A + B into B - A. Net change: we had one neg and one add (2 ops); now we have one sub (1 op). Strict reduction.
If only the RHS is a negation, fold A + -B into A - B. Same savings.

Notice the setHasNoSignedWrap bit: InstCombine preserves overflow flags across folds. If the original add was nsw and the original negation also couldn’t overflow, the new sub inherits the nsw flag. This is careful, detail-level work that matters for downstream passes that reason about integer wrapping.

ComputeKnownBits: folds with preconditions

Some folds are only valid when you can prove something about the bits of a value. InstCombine uses the ValueTracking utility for this.

Example from InstCombineAddSub.cpp:954-960:

// If X has no high-bits set above an xor mask:
// add (xor X, LowMaskC), C --> sub (LowMaskC + C), X
if (C2->isMask()) {
  KnownBits LHSKnown = computeKnownBits(X, &Add);
  if ((*C2 | LHSKnown.Zero).isAllOnes())
    return BinaryOperator::CreateSub(ConstantInt::get(Ty, *C2 + *C), X);
}

The fold add (xor X, MASK), C → sub (MASK+C), X is only legal when X’s high bits (above the mask) are provably zero. Otherwise, the xor could flip bits outside the mask and the arithmetic wouldn’t agree.

computeKnownBits returns a KnownBits struct containing Zero (bits known to be 0) and One (bits known to be 1). The check (*C2 | LHSKnown.Zero).isAllOnes() asks: “do the mask’s bits plus the known-zero bits cover all of them?”. If so, no bit of X outside the mask can be set, and the fold is safe.

Other value-tracking functions get used similarly: isKnownNonZero, ComputeNumSignBits, MaskedValueIsZero. They’re all precondition checks that let InstCombine apply a more aggressive fold when the IR guarantees the preconditions.

DCE lives inside the worklist

InstCombine does inline dead-code elimination. Three places (InstructionCombining.cpp:5767-5776, 5784-5789, 5937-5942):

// At the start of draining Deferred
if (isInstructionTriviallyDead(I, &TLI)) {
  eraseInstFromFunction(*I);
  ++NumDeadInst;
  continue;
}

// Before visiting an instruction
if (isInstructionTriviallyDead(I, &TLI)) {
  eraseInstFromFunction(*I);
  ++NumDeadInst;
  continue;
}

// After a combine, if Result itself became dead

“Trivially dead” means the instruction has no users and no side effects. Loads, stores, calls, and volatile ops are never trivially dead; adds, ands, geps and similar are candidates.

The benefit of inline DCE is that after a fold deletes or replaces one instruction, its operands may now have no users and become dead. By erasing them immediately, later worklist iterations don’t waste time trying to fold something that’s about to vanish.

Code sinking (optional)

Under -instcombine-code-sinking (InstructionCombining.cpp:5794-5862), InstCombine will also try to move an instruction closer to its uses — sink it — when the sink destination is a smaller scope. This exposes more folding opportunities. A sunk instruction is usually visible to a narrower set of blocks, where constants are more concrete and patterns more matchable.

A representative test case

From llvm/test/Transforms/InstCombine/add.ll:

; Test 8: disjoint masks
define i32 @test8(i32 %A, i32 %B) {
  %A1 = and i32 %A, 7
  %B1 = and i32 %B, 128
  %C = add i32 %A1, %B1
  ret i32 %C
}

After InstCombine:

define i32 @test8(i32 %A, i32 %B) {
  %A1 = and i32 %A, 7
  %B1 = and i32 %B, 128
  %C = or disjoint i32 %A1, %B1
  ret i32 %C
}

What happened: (A & 7) + (B & 128) — the two values can’t have any overlapping bits because 7 & 128 == 0. When two operands have disjoint bit patterns, addition is equivalent to bitwise-or. The fold replaces add with or, and annotates the or as disjoint (so downstream passes know the operands have no shared bits).

Instruction count didn’t change, but the new form is canonical (OR is preferred when disjoint) and easier for downstream passes to reason about — it’s a pure bit-level operation now instead of a potentially-carrying arithmetic.

Another quick example from the same file:

; Test 6: constant folding plus strength reduction
define i32 @test6(i32 %A) {
  %B = mul i32 7, %A
  %C = add i32 %B, %A   ; 7*A + A = 8*A
  ret i32 %C
}

; After InstCombine
define i32 @test6(i32 %A) {
  %C = shl i32 %A, 3    ; 8*A = A << 3
  ret i32 %C
}

Two folds in sequence:

7*A + A = (7+1)*A = 8*A (distributive pattern, collapses mul+add into one mul)
mul A, 8 where 8 is a power of 2 → shl A, 3 (strength reduction)

Three instructions became one, and the resulting shift is cheaper on most architectures.

The architecture, summarized

Input function
   ↓
prepareWorklist (RPOT traversal, constant fold, initial DCE)
   ↓
Main loop:
   Pop instruction from Worklist
      ↓
   DCE if trivially dead
      ↓
   Optionally sink it closer to uses
      ↓
   Visit it (dispatch to visitAdd, visitAnd, visitLoad, etc.)
      ↓
   Each visitor uses PatternMatch + KnownBits to find a fold
      ↓
   If folded:
      - Push users back onto Worklist
      - Push new Result onto Worklist
      - Delete old instruction
   Repeat until Worklist is empty
   ↓
Outer fixed-point check; iterate if changes were made.

Why InstCombine is special

More than any other LLVM pass, InstCombine is a canonicalizer. Other passes rely on its output:

GVN hashes expressions using their canonical form; without InstCombine, two semantically equal expressions might hash differently.
LoopVectorize and other loop passes pattern-match on canonical shapes (e.g., mul ... (zext ... x) vs mul (sext ... x) ...).
Target lowering expects inputs in a specific shape (constants on the RHS, sub rather than add with a negative constant, etc.).

So when you look at InstCombine and think “this fold seems trivial” — it probably is, on its own. But the cumulative effect of thousands of small canonicalizations is that the rest of the middle-end becomes drastically simpler to write. Each pass can assume input is in canonical form, which means each pass writes fewer cases.

This is also why InstCombine is run multiple times in the standard pipeline: after inlining, after SROA, after loop unrolling, etc. Each of those transformations produces IR that isn’t yet canonical, so InstCombine gets another shot at cleaning up before the next heavyweight pass runs.

InstCombine: The Worklist, the Folds, and the 'Net Simplification' Rule