Inside Clang's New Constant Interpreter: A Bytecode VM for constexpr || Packed Bits

If you compile with clang -fexperimental-new-constant-interpreter, every constexpr in your code stops going through Clang’s classic recursive-descent constant evaluator (clang/lib/AST/ExprConstant.cpp) and instead runs on a stack-based bytecode virtual machine. That VM has been in tree since 2019 and has been quietly growing into a full replacement for ExprConstant.cpp ever since.

This post is a thorough walk through that VM. We’ll cover the dispatch from ExprConstant.cpp, the TableGen-driven opcode set, how AST nodes are lowered to bytecode by the Compiler template, the two emitter backends (one buffering, one immediate), how the interpreter dispatches opcodes (switch and a musttail-based threaded interpreter under [[clang::preserve_none]]), the operand stack, the Block/Pointer model that powers constexpr lvalues, and finally a worked Add opcode showing how everything fits together. There is no JIT here — the VM is interpreted end-to-end — but the design has more in common with a small stack-machine like CPython’s than with a tree walker.

All paths are relative to clang/lib/AST/ByteCode/ unless noted, and all line numbers are from current main.

Why a new interpreter?

The classical evaluator in ExprConstant.cpp is a tree walker. Each Evaluate* function recursively descends the AST, carrying a partial result in stack-allocated LValue / APValue objects. It is very correct (it’s been hardened by years of compliance bugs) but two things go wrong in practice:

Loops. Evaluating a constexpr for loop with N iterations means walking the same AST subtree N times, allocating temporary APValues every iteration. There is no caching.
Functions. Calls into constexpr functions push fresh evaluator state on the host C++ stack and re-walk the callee’s AST. There is no compiled representation that can be reused.

The new interpreter solves both: each function is compiled to bytecode once, then the interpreter loop runs that bytecode. The bytecode is dense (std::byte[]), the operand stack is a typed slab allocator, and the dispatcher uses musttail for tight cache locality. From the user’s point of view, only the -fexperimental-new-constant-interpreter flag changes.

The doc at clang/docs/ConstantInterpreter.rst (the original RFC, lightly maintained) puts it like this:

The constexpr interpreter aims to replace the existing tree evaluator in clang, improving performance on constructs which are executed inefficiently by the evaluator.

Where the flag lives

The driver flag is declared in clang/include/clang/Options/Options.td:2162-2165:

def fexperimental_new_constant_interpreter : Flag<["-"], "fexperimental-new-constant-interpreter">, Group<f_Group>,
  HelpText<"Enable the experimental new constant interpreter">,
  Visibility<[ClangOption, CC1Option]>,
  MarshallingInfoFlag<LangOpts<"EnableNewConstInterp">>;

That MarshallingInfoFlag plumbs the value to LangOptions::EnableNewConstInterp (declared in clang/include/clang/Basic/LangOptions.def:388), which the classical evaluator checks at every entry point. From ExprConstant.cpp:21123-21128:

if (Info.EnableNewConstInterp) {
  if (!Info.Ctx.getInterpContext().evaluateAsRValue(Info, E, Result))
    return false;
  return CheckConstantExpression(Info, E->getExprLoc(), E->getType(), Result,
                                 ConstantExprKind::Normal);
}

So the flag is a routing decision at every public Evaluate-as-X entry. When set, control transfers to interp::Context, which owns the VM. The fallback path through ::Evaluate (the tree walker) is left intact.

Three layers

The interpreter is split into three layers:

Compiler (Compiler.h, Compiler.cpp) — an AST visitor that walks Stmt/Expr nodes and emits opcodes. It is templated on an Emitter type so the same compilation logic can drive two backends.
Emitter — either ByteCodeEmitter (buffers a SmallVector<std::byte> of opcodes for later execution) or EvalEmitter (interprets opcodes the moment they’re emitted, no buffer).
Interpreter (Interp.h, Interp.cpp) — the opcode implementations and dispatch loop. Used by ByteCodeEmitter’s Run(). EvalEmitter reuses individual opcode functions but runs its own control-flow loop in C++.

The Compiler<Emitter> template means EvalEmitter and ByteCodeEmitter share their entire AST-handling code; the only difference is what emitOp does at the bottom — append bytes or call the opcode immediately.

Why two emitters?

Quoting ConstantInterpreter.rst:21-30:

The compiler has two different backends: one to generate bytecode for functions (ByteCodeEmitter) and one to directly evaluate expressions as they are compiled, without generating bytecode (EvalEmitter). All functions are compiled to bytecode, while toplevel expressions used in constant contexts are directly evaluated since the bytecode would never be reused.

Concretely:

static constexpr int x = f(g(h())); — the toplevel evaluation is a one-shot. There is no point allocating a SmallVector<std::byte>, packing opcodes into it, then running them once. EvalEmitter runs each opcode against a real InterpState immediately after emission.
constexpr int f(int n) { ... } — f may be called from many places. Compiling it to bytecode amortizes the codegen cost across all call sites. ByteCodeEmitter produces a Function object whose Code is a contiguous bytecode buffer.

Context.cpp:73-101 shows evaluateAsRValue choosing EvalEmitter:

bool Context::evaluateAsRValue(State &Parent, const Expr *E, APValue &Result) {
  ++EvalID;
  // ...
  Compiler<EvalEmitter> C(*this, *P, Parent, Stk);
  auto Res = C.interpretExpr(E, /*ConvertResultToRValue=*/E->isGLValue());
  // ...
  Result = Res.stealAPValue();
  return true;
}

While Context::isPotentialConstantExpr (used by Sema for the C++14 “is this body usable as constexpr?” check) instantiates Compiler<ByteCodeEmitter> and feeds the resulting Function to Context::Run:

Compiler<ByteCodeEmitter>(*this, *P).compileFunc(FD, const_cast<Function *>(Func));
// ...
return Run(Parent, Func);

Opcodes: TableGen-generated, type-monomorphized

The opcode set lives in Opcodes.td. There are roughly 250 opcode “templates” listed there. After TableGen expansion, the actual interpreter has many more — because most opcodes are typed. For example, in Opcodes.td:599:

def Add  : AluOpcode;

AluOpcode extends Opcode with Types = [AluTypeClass] and HasGroup = 1, where AluTypeClass lists Sint8, Uint8, Sint16, Uint16, Sint32, Uint32, Sint64, Uint64, IntAP, IntAPS, Bool, FixedPoint. After TableGen expansion you get separate enum entries OP_AddSint8, OP_AddUint8, …, OP_AddIntAPS, OP_AddBool, OP_AddFixedPoint — each a fully type-specialized opcode with its own dispatcher and its own Add<PT_Sint8>(...) template instantiation.

That expansion is driven by clang/utils/TableGen/ClangOpcodesEmitter.cpp. The key helper is Enumerate (ClangOpcodesEmitter.cpp:58-82) — a recursive cartesian-product walker over the opcode’s type lists:

void Enumerate(const Record *R, StringRef N,
               std::function<void(ArrayRef<const Record *>, Twine)> &&F) {
  // walks every combination of types in R.Types and calls F with
  // a synthesized name like "AddSint8", "AddUint8", ...
}

Each enumeration step generates:

An entry in the Opcode enum: OP_AddSint8, …
A “dispatcher” static bool Interp_AddSint8(InterpState &S, CodePtr &PC) that reads the opcode’s arguments from the bytecode stream and tail-calls Add<PT_Sint8>(S, OpPC, ...).
An emitAddSint8(...) member on the emitter that writes the opcode + args into the buffer.
A case OP_AddSint8: entry in the disassembler.
For opcodes with HasGroup = 1, a group dispatcher emitAdd(PrimType T0, ...) that switches on the runtime PrimType and forwards to emitAddSint8 / emitAddUint8 / …

That last point is what makes the compiler convenient to use: when Compiler::VisitBinaryOperator knows that the result type is a 32-bit signed integer (*T == PT_Sint32), it can write this->emitAdd(*T, E) and the group dispatcher routes to emitAddSint32.

This type-monomorphization at the opcode level matters for performance. The single-typed Add<PT_Sint32> becomes:

template <PrimType Name, class T = typename PrimConv<Name>::T>
bool Add(InterpState &S, CodePtr OpPC) {
  const T &RHS = S.Stk.pop<T>();
  const T &LHS = S.Stk.pop<T>();
  // ...
}

with T = Integral<32, true> substituted in. The compiler sees a tight, monomorphic, inlinable function — not a switch (Type) { case Int32: ... case Int64: ... } ladder fired at every iteration of the loop.

Primitive types

PrimType.h:34-50:

enum PrimType : uint8_t {
  PT_Sint8 = 0, PT_Uint8 = 1, PT_Sint16 = 2, PT_Uint16 = 3,
  PT_Sint32 = 4, PT_Uint32 = 5, PT_Sint64 = 6, PT_Uint64 = 7,
  PT_IntAP = 8, PT_IntAPS = 9, PT_Bool = 10,
  PT_FixedPoint = 11, PT_Float = 12,
  PT_Ptr = 13, PT_MemberPtr = 14,
};

Fifteen primitive types covering the entire space of values the VM stores. IntAP{S} are arbitrary-but-fixed precision integers backed by APInt, used for target integer types the host can’t handle natively. Floating wraps APFloat. Pointer is more elaborate (more on it below) and MemberPointer handles C++ pointer-to-member.

PrimConv (PrimType.h:150-195) is a small trait that maps each PrimType enum value to a C++ type:

template <> struct PrimConv<PT_Sint32> { using T = Integral<32, true>; };
template <> struct PrimConv<PT_Float>  { using T = Floating; };
template <> struct PrimConv<PT_Ptr>    { using T = Pointer; };
// ...

Every templated opcode you see — Add<PT_Sint32>, Ret<PT_Ptr>, Cast<PT_Sint32, PT_Bool> — is monomorphized through PrimConv. And the macros TYPE_SWITCH / INT_TYPE_SWITCH in the same header dispatch runtime PrimType values into compile-time T = PrimConv<PT>::T blocks, used wherever the bytecode instruction stream carries a PrimType byte.

Bytecode layout and CodePtr

A function’s bytecode is a SmallVector<std::byte> (Function::Code). Each opcode is encoded as a 16-bit Opcode enum padded to pointer alignment, followed by its arguments — also padded.

The reader is CodePtr in Source.h:30-71:

class CodePtr final {
public:
  CodePtr &operator+=(int32_t Offset) { Ptr += Offset; return *this; }
  template <typename T> std::enable_if_t<!std::is_pointer<T>::value, T> read() {
    assert(aligned(Ptr));
    using namespace llvm::support;
    T Value = endian::read<T, llvm::endianness::native>(Ptr);
    Ptr += align(sizeof(T));
    return Value;
  }
private:
  const std::byte *Ptr = nullptr;
};

Every read advances by align(sizeof(T)) (where align rounds up to alignof(void*)). This wastes a few bytes per opcode but means every read is naturally aligned and every opcode boundary is void*-aligned — which the assert(aligned(Ptr)) enforces.

Pointers in the bytecode (e.g. const FunctionDecl * arguments) get a 32-bit ID instead. Program::getOrCreateNativePointer interns the host pointer in a side-table; printArg<T*> (Disasm.cpp:36-43) reverses that. The on-disk size is therefore independent of sizeof(void*), which matters because LabelOffsets and LabelRelocs use int32_t. ByteCodeEmitter::emit (ByteCodeEmitter.cpp:134-161) bails out (Success = false) the moment a function would exceed numeric_limits<unsigned>::max() bytes.

Jumps are PC-relative int32_t offsets. The emitter computes them with getOffset (ByteCodeEmitter.cpp:117-130):

int32_t ByteCodeEmitter::getOffset(LabelTy Label) {
  const int64_t Position =
      Code.size() + align(sizeof(Opcode)) + align(sizeof(int32_t));
  // If target is known, compute jump offset.
  if (auto It = LabelOffsets.find(Label); It != LabelOffsets.end())
    return It->second - Position;
  // Otherwise, record relocation and return dummy offset.
  LabelRelocs[Label].push_back(Position);
  return 0ull;
}

Forward-jumps are emitted with a placeholder zero, then patched in emitLabel once the target’s offset is known. Classic two-pass-but-actually-one-pass assembler trick.

Compiling the AST: visitIfStmt

The AST visitor lives in Compiler.cpp. It’s mechanical but instructive. visitIfStmt (Compiler.cpp:6128-6206) is a textbook example of structured-control-flow lowering:

template <class Emitter> bool Compiler<Emitter>::visitIfStmt(const IfStmt *IS) {
  // ... handle init / condition variable / consteval ...

  if (std::optional<bool> BoolValue = getBoolValue(IS->getCond())) {
    if (*BoolValue) return visitChildStmt(IS->getThen());
    if (const Stmt *Else = IS->getElse())
      return visitChildStmt(Else);
    return true;
  }

  // Compile the condition, leaving a Bool on the stack.
  if (!this->visitBool(IS->getCond()))
    return false;
  // ...
  if (const Stmt *Else = IS->getElse()) {
    LabelTy LabelElse = this->getLabel();
    LabelTy LabelEnd  = this->getLabel();
    if (!this->jumpFalse(LabelElse, IS)) return false;
    if (!visitChildStmt(IS->getThen())) return false;
    if (!this->jump(LabelEnd, IS)) return false;
    this->emitLabel(LabelElse);
    if (!visitChildStmt(Else)) return false;
    this->emitLabel(LabelEnd);
  } else {
    LabelTy LabelEnd = this->getLabel();
    if (!this->jumpFalse(LabelEnd, IS)) return false;
    if (!visitChildStmt(IS->getThen())) return false;
    this->emitLabel(LabelEnd);
  }
  return true;
}

The static branch elimination at the top is a small but pleasant optimization: if the condition is a ConstantExpr whose result is already known, skip codegen for the dead branch entirely. The new interpreter eats its own dog food this way — getBoolValue only checks ConstantExprs already evaluated by Sema, but in practice most if (some_constexpr_var) falls into this path.

Expressions are similar but typed. VisitBinaryOperator (Compiler.cpp:1064) is the workhorse — about 200 lines covering everything from pointer arithmetic to complex multiplication to the <=> spaceship operator. The core for plain integer arithmetic is the bottom switch:

switch (E->getOpcode()) {
case BO_Add:
  if (E->getType()->isFloatingType())
    return Discard(this->emitAddf(getFPOptions(E), E));
  return Discard(this->emitAdd(*T, E));
// ...
}

*T is a runtime PrimType (PT_Sint32, PT_Uint64, …) — and emitAdd is the group emitter generated by TableGen. It switches on *T once, picks the type-specialized emitAddSint32 or emitAddUint64, and writes the appropriate opcode byte plus arguments into the stream. After bytecode is emitted, the dispatcher reads back the same single byte and tail-calls the matching Add<PT_Sint32> instantiation.

The dispatch loop

Interp.cpp:2803-2825:

bool Interpret(InterpState &S) {
  assert(!S.Current->isRoot());
  CodePtr PC = S.Current->getPC();

#if USE_TAILCALLS
  return InterpNext(S, PC);
#else
  while (true) {
    auto Op = PC.read<Opcode>();
    auto Fn = InterpFunctions[Op];
    if (!Fn(S, PC)) return false;
    if (OpReturns(Op)) break;
  }
  return true;
#endif
}

There are two dispatch strategies, picked at compile time. The fallback is a classic switch-style loop: read an opcode, index into a function-pointer table, call, repeat. The fast path is the tail-call loop:

PRESERVE_NONE static bool InterpNext(InterpState &S, CodePtr &PC) {
  auto Op = PC.read<Opcode>();
  auto Fn = InterpFunctions[Op];
  MUSTTAIL return Fn(S, PC);
}

Each opcode dispatcher ends with MUSTTAIL return InterpNext(S, PC);. This turns the interpreter into a chain of tail calls — every opcode handler jumps directly to the next without unwinding the stack. Combined with [[clang::preserve_none]] (Interp.h:44-50), which tells the compiler that no callee-saved registers need to be preserved, this gives the dispatcher a very tight, predictable code path. The TableGen-generated dispatcher (ClangOpcodesEmitter.cpp:113-197) is what actually wires InterpNext into every opcode:

PRESERVE_NONE
static bool Interp_AddSint32(InterpState &S, CodePtr &PC) {
  CodePtr OpPC = PC;
  if (!Add<PT_Sint32>(S, OpPC))
    return false;
#if USE_TAILCALLS
  MUSTTAIL return InterpNext(S, PC);
#else
  return true;
#endif
}

The USE_TAILCALLS macro is set per-platform in Interp.cpp:43-50:

#if defined(_MSC_VER) || defined(__powerpc__) || !defined(MUSTTAIL) ||         \
    defined(__i386__) || defined(__sparc__)
#undef MUSTTAIL
#define MUSTTAIL
#define USE_TAILCALLS 0
#else
#define USE_TAILCALLS 1
#endif

PPC, MSVC, i386, and SPARC fall back to the switch dispatcher. The switch path is correct but slower — every opcode handler does a normal function return and the dispatcher loop re-reads the opcode and re-indexes the table.

The OpReturns check at the bottom of the switch loop is needed because in switch mode the dispatcher can’t simply “stop” — it has to detect when the current opcode was a RetX and break out. OpReturns is hand-written (Interp.cpp:2766-2774) and there’s a comment acknowledging this is sub-optimal:

// FIXME: Would be nice to generate this instead of hardcoding it here.
constexpr bool OpReturns(Opcode Op) {
  return Op == OP_RetVoid || Op == OP_RetValue || Op == OP_NoRet ||
         Op == OP_RetSint8 || Op == OP_RetUint8 || ...
}

The TableGen records mark return-shaped opcodes with CanReturn = 1, which the tail-call dispatcher uses to skip the trailing MUSTTAIL return InterpNext(...). The switch-mode loop just re-derives that information at runtime.

The operand stack

InterpStack.h:25-208. Despite the name “stack”, it is not a fixed array. It’s a linked list of 1 MiB chunks (ChunkSize = 1024 * 1024):

struct StackChunk {
  StackChunk *Next;
  StackChunk *Prev;
  uint32_t Size;
  // ... data follows in memory ...
};

Pushes go through grow():

template <size_t Size> void *grow() {
  if (LLVM_UNLIKELY(!Chunk)) {
    Chunk = new (std::malloc(ChunkSize)) StackChunk(Chunk);
  } else if (LLVM_UNLIKELY(Chunk->size() >
                           ChunkSize - sizeof(StackChunk) - Size)) {
    if (Chunk->Next) {
      Chunk = Chunk->Next;
    } else {
      StackChunk *Next = new (std::malloc(ChunkSize)) StackChunk(Chunk);
      Chunk->Next = Next;
      Chunk = Next;
    }
  }
  // bump Chunk->Size and return the slot
}

Two design choices worth noting. Chunks are kept on shrink (peekData / shrink walk back into earlier chunks if needed), so a sequence of pop / push doesn’t thrash the allocator — only a chunk whose predecessor is also empty gets freed. Slots are aligned to alignof(void*): every push rounds the object size up to pointer alignment, so heterogeneous types can sit next to each other without gymnastics.

The other unusual thing is ItemTypes:

/// SmallVector recording the type of data we pushed into the stack.
/// We don't usually need this during normal code interpretation but
/// when aborting, we need type information to call the destructors
/// for what's left on the stack.
llvm::SmallVector<PrimType> ItemTypes;

Hot-path pushes/pops do not read ItemTypes — the bytecode itself encodes which PrimType is on top, so pop<Pointer>() and pop<Integral<32, true>>() know what to expect statically. But on an aborted evaluation, the stack may have arbitrary leftover values whose types the surrounding context has forgotten. ItemTypes lets clearTo() walk down through the chunks calling the right destructors — important because Pointer, Floating, and MemberPointer aren’t trivially destructible.

The Block / Pointer model

Now the most distinctive part of the design — and the part that makes this a constexpr VM rather than a generic toy interpreter.

A Block (InterpBlock.h:44) is a contiguous chunk of “VM memory” backing a single allocation: a local variable, a global, a heap allocation, or a temporary. Each block has a Descriptor* describing its type, alignment, layout, and lifetime. The block layout (InterpBlock.h:30-43):

Block*        rawData()                  data()
│               │                         │
▼               ▼                         ▼
┌───────────────┬─────────────────────────┬─────────────────┐
│ Block         │ Metadata                │ Data            │
│ sizeof(Block) │ Desc->getMetadataSize() │ Desc->getSize() │
└───────────────┴─────────────────────────┴─────────────────┘

The Block object carries the type descriptor, an EvalID (so we can detect “this block was allocated by an earlier evaluation and shouldn’t survive into this one”), a pointer chain of all live pointers into it (used to invalidate them when the block dies), and access flags (extern/dead/weak/dummy). The actual data lives after Block’s metadata in the same allocation — data() returns it.

A Pointer (Pointer.h:97) is the constexpr analogue of an lvalue. It’s not just a void*. From Pointer.h:84-96:

Pointee                      Offset
│                              │
▼                              ▼
┌───────┬────────────┬─────────┬────────────────────────────┐
│ Block │ InlineDesc │ InitMap │ Actual Data                │
└───────┴────────────┴─────────┴────────────────────────────┘
                     ▲
                     │
                     Base

A pointer carries:

Pointee: the Block* the pointer is rooted in.
Base: the offset, in bytes, into the block where the current subfield starts. This is what tracks “I’m the .y member of the Point at BS.Pointee”.
Offset: the offset within that subfield. For a primitive the offset is 0 or 1 (one-past-end); for an array it’s the element index times the element size.
Storage tag: blocks aren’t the only kind of pointer. A Pointer can also be an IntPointer (an integer cast to a pointer), a FunctionPointer, or a TypeidPointer. The Storage enum (Pointer.h:65) selects between them.

This split between Base and Offset is what lets the interpreter answer questions like “is this pointer one-past-the-end?” or “is this access into a flexible array member?” without doing arithmetic on raw addresses, and it’s also why narrow()/expand() exist — they re-root the pointer at a sub-object boundary or back at its containing array.

The InlineDescriptor embedded in front of every composite array element / struct field (Descriptor.h:62-119) is what actually tracks “is this field initialized?”, “is this the active member of this union?”, “is this a base-class subobject?”, “is this object’s lifetime started?”. An InlineDescriptor is ~24 bytes of metadata per subobject — expensive in absolute terms, but exactly the metadata you need to enforce C++’s constexpr rules:

struct InlineDescriptor {
  unsigned Offset;
  unsigned IsConst : 1;
  unsigned IsInitialized : 1;
  unsigned IsBase : 1;
  unsigned IsActive : 1;       // active union member
  unsigned InUnion : 1;
  unsigned IsFieldMutable : 1;
  // ...
  Lifetime LifeState;          // Started/NotStarted/Destroyed/Ended
  const Descriptor *Desc;
};

For primitive arrays (e.g. int[10]) you get an InitMapPtr instead of one InlineDescriptor per element — a single bitfield tracking which array elements are initialized (InitMap.h:22). When all elements become initialized, the InitMap is freed and replaced with a sentinel value AllInitializedValue (InitMap.h:84), avoiding the cost of carrying the bitmap around for fully-initialized arrays.

Function frames and calls

InterpFrame (InterpFrame.h:27) is the VM’s call frame, stored on the host C++ stack. Its layout in memory:

+-- InterpFrame --+--- locals ---+--- args ---+
|  fields, etc.   |  (frame      |  (argument |
|                 |   slots)     |   slots)   |
+-----------------+--------------+------------+

Context::Run (Context.cpp:500-516) shows the bottom-frame creation:

bool Context::Run(State &Parent, const Function *Func) {
  InterpState State(Parent, *P, Stk, *this, Func);
  auto Memory = std::make_unique<char[]>(InterpFrame::allocSize(Func));
  InterpFrame *Frame = new (Memory.get()) InterpFrame(
      State, Func, /*Caller=*/nullptr, CodePtr(), Func->getArgSize());
  State.Current = Frame;

  if (Interpret(State)) {
    assert(Stk.empty());
    return true;
  }
  // ...
}

Argument passing is via the operand stack. Function::ParamDescriptor records, for each parameter, the offset in the caller’s stack region from which the callee should fetch it. The diagram in Function.h:91-98:

   Stack position when calling  ─────┐
   this Function                     │
                                     ▼
┌─────┬──────┬────────┬────────┬─────┬────────────────────┐
│ RVO │ This │ Param1 │ Param2 │ ... │                    │
└─────┴──────┴────────┴────────┴─────┴────────────────────┘

The optional RVO slot at the front is for return-value-optimization: when a constexpr function returns a non-primitive (struct, array), the caller pre-allocates space for the result and passes a Pointer to it as an implicit first argument. The function constructs into that pointer instead of returning a value through the stack. This mirrors how Itanium ABI handles non-trivially-copyable returns and avoids needing the VM stack to ever hold a struct value.

Call (Interp.cpp:1747-1837) is the implementation:

bool Call(InterpState &S, CodePtr OpPC, const Function *Func, uint32_t VarArgSize) {
  // ... safety/validity checks ...

  if (!Func->isFullyCompiled())
    compileFunction(S, Func);

  // ... more checks ...

  auto Memory = new char[InterpFrame::allocSize(Func)];
  auto NewFrame = new (Memory) InterpFrame(S, Func, OpPC, VarArgSize);
  InterpFrame *FrameBefore = S.Current;
  S.Current = NewFrame;

  bool Success = Interpret(S);
  // ...
  return true;
}

Two interesting moves here: compileFunction is called lazily on first use (Func->isFullyCompiled() is the gate), and Interpret(S) is recursively called within Call. So the host C++ stack mirrors the VM call stack 1:1 — the VM doesn’t have its own scheduler. A constexpr recursion N levels deep eats N host stack frames, plus N InterpFrame allocations on the heap, plus whatever bytecode is executing at each level. CheckCallDepth (called immediately before frame allocation) bounds this to LangOptions::ConstexprCallDepth.

Add: a worked opcode

Putting the pieces together. When the compiler sees a + b for two ints:

Compiler::VisitBinaryOperator (Compiler.cpp:1064) classifies both operands as PT_Sint32, visits the LHS (which leaves an Integral<32, true> on the operand stack), visits the RHS, then calls this->emitAdd(PT_Sint32, E).
The TableGen-generated group emitter switches on PT_Sint32 and routes to emitAddSint32, which writes the byte for OP_AddSint32 into the bytecode buffer (no args — Add has no immediate operands).
At interpretation time, InterpNext reads OP_AddSint32, indexes InterpFunctions[OP_AddSint32] to find Interp_AddSint32, and tail-calls it.
Interp_AddSint32 calls Add<PT_Sint32>(S, OpPC).

The Add template is in Interp.h:380-396:

template <PrimType Name, class T = typename PrimConv<Name>::T>
bool Add(InterpState &S, CodePtr OpPC) {
  const T &RHS = S.Stk.pop<T>();
  const T &LHS = S.Stk.pop<T>();
  const unsigned Bits = RHS.bitWidth() + 1;

  if constexpr (isIntegralOrPointer<T>()) {
    if (LHS.isNumber() != RHS.isNumber())
      return AddSubNonNumber<T, std::plus>(S, OpPC, LHS, RHS);
    else if (LHS.isNumber() && RHS.isNumber())
      ; // Fall through to proper addition below.
    else
      return false;
  }

  return AddSubMulHelper<T, T::add, std::plus>(S, OpPC, Bits, LHS, RHS);
}

The interesting case is Integral-shaped types that might be pointer-like (because integers and integers-cast-from-pointers share a representation). For real integers, control falls through to AddSubMulHelper (Interp.h:303-352):

template <typename T, bool (*OpFW)(T, T, unsigned, T *),
          template <typename U> class OpAP>
bool AddSubMulHelper(InterpState &S, CodePtr OpPC, unsigned Bits, const T &LHS,
                     const T &RHS) {
  // Fast path - add the numbers with fixed width.
  T Result;
  if (!OpFW(LHS, RHS, Bits, &Result)) {
    S.Stk.push<T>(Result);
    return true;
  }
  // If we got here, fixed-width add overflowed.
  S.Stk.push<T>(Result);
  // ...
  if (S.Current->getExpr(OpPC)->getType().isWrapType())
    return true;

  // Slow path - compute the result using another bit of precision.
  APSInt Value = OpAP<APSInt>()(LHS.toAPSInt(Bits), RHS.toAPSInt(Bits));
  // ... emit the overflow diagnostic via the Expr's source location ...
  if (!handleOverflow(S, OpPC, Value)) {
    S.Stk.pop<T>();
    return false;
  }
  return true;
}

The fast path is T::add — for Integral<32, true> this is two native 32-bit adds with overflow detection. If that returns “no overflow”, the result goes back on the stack and we’re done. If it overflows, the helper falls into the slow path: re-compute with APSInt at one extra bit of precision, ask handleOverflow whether the language permits this overflow at this expression (signed arithmetic in C++? UB; with -fwrapv? defined; in __builtin_add_overflow? caller wants the wrap), and either emit a diagnostic via S.Current->getExpr(OpPC) (the AST node the opcode was emitted from) or accept the wrapped value.

Notice how the AST is still in the picture at runtime — S.Current->getExpr(OpPC) looks up the original Expr* for the current PC via Function::SrcMap (Source.h:98, populated by the emitter on every emitOp). This is what gives the new interpreter excellent diagnostics: every opcode knows which AST node produced it, so any UB diagnostic carries the original source location and expression text.

Two subtleties: speculation and step counting

Step counting. Constexpr evaluation has a step limit (default 1,048,576, configurable via -fconstexpr-steps=N). The interpreter charges steps in InterpState::noteStep (InterpState.cpp:160-169):

bool InterpState::noteStep(CodePtr OpPC) {
  if (InfiniteSteps) return true;
  --StepsLeft;
  if (StepsLeft != 0) return true;
  FFDiag(Current->getSource(OpPC), diag::note_constexpr_step_limit_exceeded);
  return false;
}

noteStep is called only at jumps — Jmp, Jt, Jf (Interp.cpp:60-77). That is, one step per backedge or branch, not one step per opcode. Linear sequences are free; loops cost what they should. The classical evaluator counts steps differently (every statement), so the same code may hit the limit at different points under the two evaluators — something to keep in mind when comparing them.

__builtin_constant_p / speculation. __builtin_constant_p(x) needs to ask “would evaluating x succeed in a constexpr context?” without committing to it, and without emitting any diagnostics that the failure path would emit. The new interpreter handles this with a speculation mechanism centered on the BCP opcode (Interp.cpp:2837-2914):

PRESERVE_NONE static bool BCP(InterpState &S, CodePtr &RealPC, int32_t Offset,
                              PrimType PT) {
  size_t StackSizeBefore = S.Stk.size();
  CodePtr PC = RealPC;
  auto SpeculativeInterp = [&S, &PC]() -> bool {
    PushIgnoreDiags(S, PC);
    auto _ = llvm::scope_exit([&]() { PopIgnoreDiags(S, PC); });
    // ... run the speculation ...
  };

  if (SpeculativeInterp()) {
    // Pop the result and push 1.
    S.Stk.push<Integral<32, true>>(Integral<32, true>::from(1));
  } else {
    EndSpeculation(S, RealPC);
    if (!S.inConstantContext())
      return Invalid(S, RealPC);
    S.Stk.clearTo(StackSizeBefore);
    S.Stk.push<Integral<32, true>>(Integral<32, true>::from(0));
  }
  // ...
  RealPC += Offset - ParamSize;
  return true;
}

BCP is a “branch on speculation result” — it runs the embedded subprogram with diagnostics suppressed (the PushIgnoreDiags/PopIgnoreDiags opcodes literally toggle a counter on InterpState), and on failure, snaps the stack back to its pre-speculation height before continuing past the speculation block. The Offset is the bytecode distance to the post-__builtin_constant_p continuation, computed by the emitter.

Why is it still “experimental”?

Two reasons:

Coverage gaps. A handful of corner cases of C++ constant evaluation aren’t yet implemented (some atomic ops, certain vector intrinsics, specific MSVC constexpr extensions). When the new interpreter hits one, it either falls back gracefully or emits an error — the Unsupported opcode (Opcodes.td:837) is exactly the marker for “compile-time bail”.
Diagnostics parity. The classical evaluator has had ten years of users filing bugs against its diagnostics. The new one is mostly there but not quite — every release narrows the gap a little. Tests under clang/test/AST/ByteCode/ mirror the corresponding clang/test/SemaCXX/ cases, comparing output between the two paths.

The flag is opt-in for now, with the eventual goal stated in the original RFC: replace ExprConstant.cpp’s evaluator wholesale.

What to read next

clang/docs/ConstantInterpreter.rst — the original (slightly out-of-date) RFC. Good for the high-level motivation and the type-system overview.
Opcodes.td — the entire opcode set is here. Reading it top-to-bottom is the fastest way to see what the VM can do.
Interp.h — the implementation of every opcode. Every bool OpName(InterpState &, CodePtr OpPC, ...) you see is a real handler reachable from a real bytecode byte.
clang/test/AST/ByteCode/ — small C++ snippets paired with FileCheck assertions. The tests exercise the corner cases of every opcode group; reading them is how you understand “what does this actually do for a class with a virtual base?”.

The new interpreter is a great example of a “small VM” embedded in a much larger codebase — strongly typed, sparse on metadata only where needed (InitMap for primitive arrays vs. per-field InlineDescriptor for composite ones), and shaped end-to-end by what C++ constant evaluation has to be able to express. If you’ve read this far, the next time you see -fexperimental-new-constant-interpreter in a build log, you’ll know exactly which directory the work is happening in.