Control Flow Integrity in LLVM: How the Compiler Fences Off Your Function Pointers || Packed Bits

If an attacker gets memory-corruption primitives in your C++ program — a buffer overflow, a use-after-free, a type confusion — one of the most powerful things they can do is overwrite a function pointer or a vtable pointer. The program then dutifully calls the attacker’s chosen address, and now they’re running code of their choosing. Return-oriented programming (ROP), jump-oriented programming (JOP), and most modern exploitation chains start with exactly this step.

Control Flow Integrity (CFI) is a defense that asks, at every indirect call: “is this pointer a legitimate target for this call site?”. If the pointer isn’t on a whitelist of functions that could legitimately be called through this callsite’s type, the program traps before executing the hijacked call. The attacker’s primitive is still there, but the payoff is gone.

Clang’s CFI implementation (behind -fsanitize=cfi) is a nice thing to study because it sits at the intersection of frontend, middle-end, and linker: Clang emits intrinsics, LLVM lowers them into fast bitset checks, and the linker is responsible for ensuring the bitset reflects every valid target. This post walks through each piece.

All line numbers reference files under /data/dev/llvm-project/. The big files are:

clang/lib/CodeGen/CGExpr.cpp and ItaniumCXXABI.cpp — where Clang emits the CFI intrinsics.
clang/lib/CodeGen/CodeGenModule.cpp — where type identifiers get created.
llvm/lib/Transforms/IPO/LowerTypeTests.cpp — where the middle-end turns intrinsics into runtime checks.
llvm/lib/Transforms/IPO/WholeProgramDevirt.cpp — a related pass that uses the same type metadata for devirtualization.
compiler-rt/lib/cfi/cfi.cpp and compiler-rt/lib/ubsan/ubsan_handlers.cpp — the runtime.

The threat model

The attack CFI is designed to stop looks like this. You have a C++ program:

struct Shape {
  virtual void draw() = 0;
};

struct Circle : Shape {
  void draw() override { ... }
};

void render(Shape *s) {
  s->draw();   // indirect call through vtable
}

Somewhere in the program, a memory safety bug lets an attacker overwrite the bytes at *(void**)s (the vtable pointer). They change it to point into memory they control — a fake vtable whose first slot is the address of some useful gadget in the program, or even into shellcode they’ve injected. When render runs s->draw(), the compiler loads the vtable slot at offset 0 and does an indirect call through it, and now attacker code is executing.

CFI breaks this sequence by inserting a check: “before you call through this pointer, verify that the pointer is a valid vtable for a Shape subclass”. A forged pointer to a fake vtable won’t match; the program traps.

The same idea applies to plain function pointers:

void (*fp)() = get_callback();
fp();   // if fp has been overwritten, we jump wherever attacker says

Here CFI (specifically -fsanitize=cfi-icall) checks “is fp a function pointer of the right signature?”. Only functions that have been declared with the matching signature can satisfy the check; anything else traps.

The CFI flag family

Clang exposes CFI as several overlapping sanitizer modes:

cfi-vcall — check virtual calls (reads vtable).
cfi-nvcall — check non-virtual calls through member function pointers.
cfi-icall — check indirect calls through plain function pointers.
cfi-mfcall — check member function pointer invocations.
cfi-derived-cast — check that dynamic_cast<Derived*>(base_ptr) returns a valid Derived vtable.
cfi-unrelated-cast — check that reinterpret_cast between unrelated class pointers isn’t used to forge a call target.
cfi-cast-strict — stricter cast policy (catches some cases cfi-derived-cast allows).

-fsanitize=cfi turns on most of them at once. The underlying mechanism — emit a llvm.type.test intrinsic, let LowerTypeTests lower it — is the same across all of them. The differences are what type identifier is used and where the intrinsic gets inserted.

The central intrinsic: `llvm.type.test`

Everything flows through one LLVM intrinsic. From llvm/include/llvm/IR/Intrinsics.td:2645:

def int_type_test : DefaultAttrsIntrinsic<[llvm_i1_ty],
                                          [llvm_ptr_ty, llvm_metadata_ty],
                                          [IntrNoMem, IntrSpeculatable]>;

Its signature is:

declare i1 @llvm.type.test(ptr %ptr, metadata %type_id)

The intrinsic answers the question “is %ptr in the set of valid targets identified by %type_id?”. Before LowerTypeTests runs, there’s no real implementation; it’s just a marker the frontend placed for the backend to lower later. After LowerTypeTests runs, the intrinsic is replaced by a sequence of real instructions that performs the bitset check.

The second argument is metadata, not a value — it’s a MDString that names the set. Typical identifiers:

!"_ZTSFvE" — the mangled Itanium type of void(). Every function with that signature is a valid target of calls through a void (*)() pointer.
!"_ZTS6Circle" — the mangled type of class Circle. Every vtable for Circle or a Circle subclass is a valid vtable for it.
!"all-vtables" — a catch-all used in cross-DSO CFI mode to ask “is this any vtable in the program at all?”.

The interesting thing about the identifiers is that they’re strings. At link time, the linker gathers all functions and vtables that carry metadata tagged with a given string, builds the bitset for that string, and wires up LowerTypeTests’s generated code to read from it. This is the “whole program” part of the system: the identifier is a global name that every translation unit agrees on.

There’s also a slightly fancier sibling (Intrinsics.td:2649):

declare { ptr, i1 } @llvm.type.checked.load(ptr %ptr, i32 %offset,
                                            metadata %type_id)

This one loads the pointer at ptr + offset and validates it, both in the same atomic step. It’s used for virtual calls where the load and the check are tightly coupled. The return value is a pair: the loaded function pointer, and a boolean saying whether the load was valid.

How Clang labels your vtables

For the intrinsic to mean anything, each valid target has to be labeled with the matching identifier. For vtables, that labeling happens in CodeGenModule::AddVTableTypeMetadata (clang/lib/CodeGen/CodeGenModule.cpp:8398-8414):

void CodeGenModule::AddVTableTypeMetadata(llvm::GlobalVariable *VTable,
                                          CharUnits Offset,
                                          const CXXRecordDecl *RD) {
  CanQualType T = getContext().getCanonicalTagType(RD);
  llvm::Metadata *MD = CreateMetadataIdentifierForType(T);
  VTable->addTypeMetadata(Offset.getQuantity(), MD);

  if (CodeGenOpts.SanitizeCfiCrossDso)
    if (auto CrossDsoTypeId = CreateCrossDsoCfiTypeId(MD))
      VTable->addTypeMetadata(Offset.getQuantity(),
                              llvm::ConstantAsMetadata::get(CrossDsoTypeId));

  if (NeedAllVtablesTypeId()) {
    llvm::Metadata *MD = llvm::MDString::get(getLLVMContext(), "all-vtables");
    VTable->addTypeMetadata(Offset.getQuantity(), MD);
  }
}

The call VTable->addTypeMetadata(offset, MD) attaches a metadata node of the form !{offset, MD} to the vtable global. For a class like struct Circle : Shape {}, the Clang-emitted vtable looks roughly like:

@_ZTV6Circle = constant { ... } {
  ptr null,                    ; RTTI slot
  ptr @_ZN6Circle4drawEv,      ; Circle::draw
  ...
}, !type !{i64 16, !"_ZTS6Circle"},
   !type !{i64 16, !"_ZTS5Shape"},
   !type !{i64 16, !"all-vtables"}

Three metadata entries at offset 16 (where the “address point” of the vtable is — past the RTTI slot). The entries say:

At offset 16, there’s a vtable compatible with type _ZTS6Circle.
At offset 16, there’s a vtable compatible with type _ZTS5Shape (since Circle : Shape, any use that requires a Shape vtable is satisfied too).
At offset 16, there’s a vtable (generic tag for cross-DSO mode).

Because each class’s vtable is labeled with every type in its hierarchy, the linker can assemble the bitset for _ZTS5Shape to include every derived-class vtable in the program, for every derivation.

For functions, the labeling happens via function-level !type metadata, generated from CreateMetadataIdentifierForFnType (CodeGenModule.cpp:8361-8368):

llvm::Metadata *CodeGenModule::CreateMetadataIdentifierForFnType(QualType T) {
  assert(isa<FunctionType>(T));
  T = GeneralizeFunctionType(
      getContext(), T, getCodeGenOpts().SanitizeCfiICallGeneralizePointers);
  if (getCodeGenOpts().SanitizeCfiICallGeneralizePointers)
    return CreateMetadataIdentifierGeneralized(T);
  return CreateMetadataIdentifierForType(T);
}

Every void(int) function in the program gets the same identifier; every int(char*, int) gets a different one; and so on. The canonical type is what determines the string (via the Itanium mangler).

Where Clang emits the check: virtual calls

The code for virtual-call emission lives in clang/lib/CodeGen/ItaniumCXXABI.cpp:700-820. The core snippet (ItaniumCXXABI.cpp:766-774):

if (ShouldEmitCFICheck || ShouldEmitWPDInfo) {
  llvm::Value *VFPAddr = Builder.CreateGEP(CGF.Int8Ty, VTable, VTableOffset);
  llvm::Intrinsic::ID IID = CGM.HasHiddenLTOVisibility(RD)
                                ? llvm::Intrinsic::type_test
                                : llvm::Intrinsic::public_type_test;
  CheckResult = Builder.CreateCall(CGM.getIntrinsic(IID), {VFPAddr, TypeId});
}

For a call s->draw(), the IR ends up looking like:

%vtable = load ptr, ptr %s
%vfn_addr = getelementptr i8, ptr %vtable, i64 16      ; draw is at offset 16
%cfi_ok = call i1 @llvm.type.test(ptr %vfn_addr, metadata !"_ZTS5Shape")
br i1 %cfi_ok, label %cont, label %trap

trap:
  call void @llvm.ubsantrap(i8 0)
  unreachable

cont:
  %vfn = load ptr, ptr %vfn_addr
  call void %vfn(ptr %s)

The check is a boolean: “is the vtable pointer at %vfn_addr part of the legitimate vtables for Shape?”. Only if yes does execution fall through to the real virtual call.

The sister intrinsic llvm.type.checked.load fuses the load and the check. It shows up when Clang is also targeting Virtual Function Elimination (VFE) optimizations. From ItaniumCXXABI.cpp:749-762:

if (ShouldEmitVFEInfo) {
  llvm::Value *VFPAddr = Builder.CreateGEP(CGF.Int8Ty, VTable, VTableOffset);
  llvm::Value *CheckedLoad = Builder.CreateCall(
      CGM.getIntrinsic(llvm::Intrinsic::type_checked_load),
      {VFPAddr, llvm::ConstantInt::get(CGM.Int32Ty, 0), TypeId});
  CheckResult = Builder.CreateExtractValue(CheckedLoad, 1);
  VirtualFn = Builder.CreateExtractValue(CheckedLoad, 0);
}

This collapses %vfn_addr = gep ..., %cfi_ok = call @type.test ..., and %vfn = load ... into a single intrinsic. Middle-end optimizations have an easier time reasoning about one intrinsic than about a pointer-load-and-check pattern.

Where Clang emits the check: indirect calls

Plain indirect calls go through CodeGenFunction::EmitCallee and friends in CGExpr.cpp. The CFI insertion is at clang/lib/CodeGen/CGExpr.cpp:6969-6997:

if (SanOpts.has(SanitizerKind::CFIICall) &&
    (!TargetDecl || !isa<FunctionDecl>(TargetDecl)) && !CFIUnchecked) {
  auto CheckOrdinal = SanitizerKind::SO_CFIICall;
  auto CheckHandler = SanitizerHandler::CFICheckFail;
  SanitizerDebugLocation SanScope(this, {CheckOrdinal}, CheckHandler);
  EmitSanitizerStatReport(llvm::SanStat_CFI_ICall);

  llvm::Metadata *MD = CGM.CreateMetadataIdentifierForFnType(QualType(FnType, 0));
  llvm::Value *TypeId = llvm::MetadataAsValue::get(getLLVMContext(), MD);
  llvm::Value *CalleePtr = Callee.getFunctionPointer();
  llvm::Value *TypeTest = Builder.CreateCall(
      CGM.getIntrinsic(llvm::Intrinsic::type_test), {CalleePtr, TypeId});

  auto CrossDsoTypeId = CGM.CreateCrossDsoCfiTypeId(MD);
  llvm::Constant *StaticData[] = {
      llvm::ConstantInt::get(Int8Ty, CFITCK_ICall),
      EmitCheckSourceLocation(E->getBeginLoc()),
      EmitCheckTypeDescriptor(QualType(FnType, 0)),
  };
  if (CGM.getCodeGenOpts().SanitizeCfiCrossDso && CrossDsoTypeId) {
    EmitCfiSlowPathCheck(CheckOrdinal, TypeTest, CrossDsoTypeId, CalleePtr,
                         StaticData);
  } else {
    EmitCheck(std::make_pair(TypeTest, CheckOrdinal), CheckHandler,
              StaticData, {CalleePtr, llvm::UndefValue::get(IntPtrTy)});
  }
}

The type identifier comes from CreateMetadataIdentifierForFnType(QualType(FnType, 0)). If the call site sees a void (*)(int), the type id is the mangling of void(int). Only functions labeled with that same mangling are legal targets.

Here’s the from-the-test-suite example (clang/test/CodeGen/cfi-icall.c):

void f() {}
void xf();

void g(int b) {
  void (*fp)() = b ? f : xf;
  fp();
}

IR after Clang:

define void @f() !type !0 !type !1 { ret void }
declare void @xf() !type !0 !type !1

define void @g(i32 %b) {
  %cond = icmp ne i32 %b, 0
  %fp = select i1 %cond, ptr @f, ptr @xf
  %cfi_ok = call i1 @llvm.type.test(ptr %fp, metadata !"_ZTSFvE")
  br i1 %cfi_ok, label %cont, label %trap

trap:
  call void @llvm.ubsantrap(i8 2)
  unreachable

cont:
  call void %fp()
  ret void
}

!0 = !{i64 0, !"_ZTSFvE"}
!1 = !{i64 0, !"_ZTSFvE.generalized"}

f and xf both got labeled with _ZTSFvE (mangled void()), so they’re both legal targets for the indirect call. Any other function — say, a void (*)(int) — wouldn’t have this label and would fail the check.

How LowerTypeTests lowers the intrinsic

The intrinsic is just a placeholder. The pass LowerTypeTests (llvm/lib/Transforms/IPO/LowerTypeTests.cpp) is what turns @llvm.type.test into actual instructions. It runs late (typically during LTO) because it needs to see the whole module’s type metadata to build the bitsets.

The core algorithm is bitset-based. For a type identifier T:

Collect every global (function or vtable) tagged with !type !{offset, T} — the set of valid targets.
Arrange those globals’ addresses in a compact layout (they don’t need to be adjacent in the source, but they can be made adjacent via a linker-coordinated combined global).
Build a bitset where each bit represents one possible address-slot at the chosen alignment.
Replace the @llvm.type.test call with: “compute offset from the start of the combined global, bounds-check, then look up the bit in the bitset”.

The bitset data structure is BitSetInfo (LowerTypeTests.cpp:136-200):

struct BitSetInfo {
  uint64_t ByteOffset;
  uint64_t AlignLog2;
  uint64_t BitSize;
  std::set<uint64_t> Bits;

  bool containsGlobalOffset(uint64_t Offset) const {
    if (Offset < ByteOffset) return false;
    if ((Offset - ByteOffset) % (uint64_t(1) << AlignLog2) != 0) return false;
    uint64_t BitOffset = (Offset - ByteOffset) >> AlignLog2;
    if (BitOffset >= BitSize) return false;
    return Bits.count(BitSize - 1 - BitOffset);
  }
};

Reading that: given an offset, first check it’s past ByteOffset (start of the bitset’s covered region), then check it’s properly aligned (AlignLog2 bits of trailing zero), then check it’s within BitSize slots, then look up the bit.

The AlignLog2 trick compresses the bitset. If every valid target is 16-byte aligned, we only need one bit per 16 bytes of address space, not one bit per byte. This is why, for a vtable or jump-table region that might span many kilobytes of address space, the bitset itself can be tens of bytes.

The rotate trick

The generated check uses a clever sequence. From lowerTypeTestCall (LowerTypeTests.cpp:735-820):

Value *PtrOffset = B.CreateSub(OffsetedGlobalAsInt, PtrAsInt);

Value *BitOffset = B.CreateIntrinsic(IntPtrTy, Intrinsic::fshr,
                                     {PtrOffset, PtrOffset, TIL.AlignLog2});
Value *OffsetInRange = B.CreateICmpULE(BitOffset, TIL.SizeM1);

What’s going on: after subtracting the tested pointer from the base, we get an offset that should be a multiple of the alignment. To check alignment and range at once, the code does a funnel shift right (fshr) by AlignLog2 bits. This rotates the low-order bits (which should be zero) into the high-order bits.

If the pointer was aligned, the high bits after the rotate are zero (they came from the zero-valued low bits).
If the pointer was misaligned, the high bits after the rotate are non-zero (they came from somewhere in the middle of the offset).

A subsequent <= comparison against SizeM1 (the bitset size minus 1) then catches both “misaligned” (would be a huge value with high bits set) and “out of range” (would be > SizeM1) with a single unsigned compare. Elegant.

If the check passes, the generated code then either:

Reads the actual bit from a byte-array bitset: createBitSetTest (LowerTypeTests.cpp:667-692), or
Inlines the bitset as a constant when it’s small (64 bits or less), or
Just returns true if the range is fully valid (the AllOnes case, where every address in the range is valid).

The final test is maybe 3-5 machine instructions on x86: sub, ror, cmp, possibly a mov and a test. Cheap enough to insert at every indirect call without destroying performance.

Building bitsets from multiple globals

Because the bitset works off pointer-subtraction, the valid targets have to be laid out somewhere the pass knows about. buildBitSetsFromGlobalVariables (LowerTypeTests.cpp:824-894) does exactly that:

void LowerTypeTestsModule::buildBitSetsFromGlobalVariables(
    ArrayRef<Metadata *> TypeIds, ArrayRef<GlobalTypeMember *> Globals) {
  std::vector<Constant *> GlobalInits;
  const DataLayout &DL = M.getDataLayout();
  DenseMap<GlobalTypeMember *, uint64_t> GlobalLayout;
  Align MaxAlign;
  uint64_t CurOffset = 0;
  uint64_t DesiredPadding = 0;
  for (GlobalTypeMember *G : Globals) {
    auto *GV = cast<GlobalVariable>(G->getGlobal());
    Align Alignment =
        DL.getValueOrABITypeAlignment(GV->getAlign(), GV->getValueType());
    MaxAlign = std::max(MaxAlign, Alignment);
    uint64_t GVOffset = alignTo(CurOffset + DesiredPadding, Alignment);
    GlobalLayout[G] = GVOffset;
    // ... padding and init insertion ...
  }
  Constant *NewInit = ConstantStruct::getAnon(M.getContext(), GlobalInits);
  auto *CombinedGlobal = new GlobalVariable(M, NewInit->getType(), true,
                         GlobalValue::PrivateLinkage, NewInit);
  lowerTypeTestCalls(TypeIds, CombinedGlobal, GlobalLayout);
}

The pass physically merges the individual globals into one big combined global, with careful padding so every original global starts at a power-of-two-aligned offset. Then it creates aliases from the original symbols into the combined global, so existing references still work. The bitset’s ByteOffset is the start of the combined global; AlignLog2 is the chosen alignment; the set of one-bits corresponds to which offsets contain valid globals.

Functions and jump tables

For functions, LowerTypeTests uses a different strategy: a jump table. buildBitSetsFromFunctions (LowerTypeTests.cpp:1390) emits a small table of jmp instructions, one per valid function, all aligned. Then every call through a valid function pointer goes through the jump table, and the bitset check becomes “is this pointer in the jump table’s address range?”.

On x86-64, each jump-table entry looks like:

jmp func@plt
int3
int3
int3

Eight bytes, aligned. The bitset is dense (every 8-byte slot is a valid target), so for function-pointer CFI the test often degenerates to “is pointer in [jump_table_start, jump_table_end)?” — two compares, a handful of cycles.

An amusing side effect: if you dump the disassembly of a CFI-enabled binary, there’s a huge region of jmp func; int3; int3; int3 sequences. That’s the jump table, and it’s what CFI uses to constrain which functions can be reached via indirect call.

Whole-program devirtualization uses the same metadata

There’s a companion pass, WholeProgramDevirt (llvm/lib/Transforms/IPO/WholeProgramDevirt.cpp), that uses the same !type metadata for a different purpose: to devirtualize virtual calls.

When the whole program is visible (LTO), and a virtual call site has only one possible implementation across the entire program, WPD can replace the indirect call with a direct call. The resolution kinds are ModuleSummaryIndex.h:~1292:

enum Kind {
  Indir,        // Regular indirect call, no optimization
  SingleImpl,   // Replace with direct call to the single implementation
  BranchFunnel, // Retpoline-safe variant
};

Plus per-argument optimizations:

enum Kind {
  UniformRetVal,    // All implementations return the same value
  UniqueRetVal,     // Use vtable identity to pick the right value
  VirtualConstProp, // Embed computed return values in the vtable
};

SingleImpl is the big one. Consider:

struct Shape { virtual double area() const = 0; };
struct Circle : Shape { double area() const override { return 3.14 * r * r; } };

// Used as: for (auto &s : shapes) total += s.area();

If Circle is the only subclass of Shape in the program, WPD can replace s.area() with a direct call to Circle::area. Which means: no vtable load, no indirect call, no CFI check, no branch misprediction overhead from the indirect call.

WPD and CFI are complementary. CFI protects the indirect calls you have. WPD removes the indirect calls it can prove are unnecessary. Both rely on the same type metadata to reason about “which functions are valid at this call site?”.

Cross-DSO CFI

CFI within a single compilation (or a single LTO unit) is straightforward because the linker sees everything. But what about calls across shared library boundaries? Your main executable knows nothing about the type metadata in libfoo.so; the fast-path bitset can’t include functions it can’t see.

The solution is cross-DSO CFI. When cross-DSO is enabled, each DSO emits its own __cfi_check function — a weakly-linked symbol that knows its own type metadata. On a CFI check where the target is in another DSO:

The caller does its normal fast-path bitset check against its own type metadata.
If the check fails, instead of immediately trapping, the caller calls __cfi_slowpath.
__cfi_slowpath looks up which DSO the target pointer belongs to (via the runtime’s book of loaded libraries), then calls that DSO’s __cfi_check.
The target’s __cfi_check validates the pointer against its own metadata.
If the validation passes, execution returns and the call goes through. If it fails, __cfi_check_fail is called, which ultimately traps or reports.

The stub definition is in CodeGenFunction::EmitCfiCheckStub (clang/lib/CodeGen/CGExpr.cpp:4329-4361):

llvm::Function *F = llvm::Function::Create(
    llvm::FunctionType::get(VoidTy, {Int64Ty, VoidPtrTy, VoidPtrTy}, false),
    llvm::GlobalValue::WeakAnyLinkage, "__cfi_check", M);

llvm::BasicBlock *BB = llvm::BasicBlock::Create(Ctx, "entry", F);
SmallVector<llvm::Value*> Args{F->getArg(2), F->getArg(1)};
llvm::CallInst::Create(M->getFunction("__cfi_check_fail"), Args, "", BB);
llvm::ReturnInst::Create(Ctx, nullptr, BB);

Each compilation unit contributes a weak __cfi_check. At link time they collapse into one per DSO. Each DSO’s __cfi_check knows its own type metadata; the runtime wiring routes cross-DSO calls through the right one.

This scheme has a performance cost — the slow path is slower — but it’s correct across dynamic linking, and the fast path is unaffected for same-DSO calls (the common case).

The failure path

When a CFI check fails (the bitset says the pointer isn’t valid), you end up in one of two places depending on configuration:

Trap mode (-fsanitize-trap=cfi): @llvm.ubsantrap(i8 N) is invoked, which compiles to an immediate abort instruction (e.g., ud2 on x86-64). The program dies with a segfault-like behavior. Fast, silent, and no runtime dependencies.
Diagnostic mode (default in some configurations): The pass emits a call to __ubsan_handle_cfi_check_fail (defined in compiler-rt/lib/ubsan/ubsan_handlers.cpp:923-940):

void __ubsan::__ubsan_handle_cfi_check_fail(CFICheckFailData *Data,
                                             ValueHandle Value,
                                             ValueHandle ValidVtable) {
  handleCFICheckFail(Data, Value, ValidVtable, Opts);
}

The handler inspects the Data struct (which encodes the source location and the kind of CFI check), formats a diagnostic message (“Control flow integrity check failed: indirect call through incompatible function pointer at …”), and then either continues (for handle_cfi_check_fail) or aborts (for handle_cfi_check_fail_abort).

For production, trap mode is the common choice: tiny code size overhead, no runtime dependency, deterministic behavior on failure. For debug builds or sanitizer runs, diagnostic mode gives you a stack trace pointing at the site of corruption.

The pieces, put together

Let me try to sketch the full lifecycle of one CFI-protected indirect call:

┌─ Your C++ code ─┐
│  void (*fp)();  │
│  fp();          │
└────────┬────────┘
         │
         ▼ Clang frontend
┌────────────────────────────┐
│ call i1 @llvm.type.test(   │
│    ptr %fp,                │
│    metadata !"_ZTSFvE")    │
│ br i1 %res, %ok, %trap     │
└────────┬───────────────────┘
         │
         ▼ LLVM middle-end
         │ (possibly WPD first — devirtualizes what it can)
         │
         ▼ LowerTypeTests (in LTO)
┌────────────────────────────────┐
│ (collect all !type !"_ZTSFvE") │
│ (emit jump table or bitset)    │
│ (lower @type.test to:          │
│   ptrtoint, sub, ror, cmp, br) │
└────────┬───────────────────────┘
         │
         ▼ Backend + linker
┌────────────────────────────────┐
│ Real machine instructions      │
│ that check: is %fp in the      │
│ jump table for void(void)?     │
│ If yes, call it.               │
│ If no, ud2 (trap).             │
└────────────────────────────────┘

At runtime, each indirect call now has a handful of extra cycles of check. The overhead is measured in percentage points, not multiples. For the security guarantee — that an attacker can’t redirect control flow to arbitrary addresses no matter how much memory they corrupt — it’s an excellent trade.

Why it works

The safety property CFI relies on is that type metadata is in read-only memory. The !type annotations ultimately become read-only data sections produced by the linker. An attacker with a heap overflow or a dangling pointer can’t modify those sections (they’re write-protected). So the bitset is trustworthy: its contents reflect what was determined at link time by the compiler and linker working together.

The attacker’s options are reduced to: find a pointer in the bitset that happens to be useful for exploitation. For function-pointer CFI, that’s “find a function with my target’s signature that does something useful” — typically impossible for useful attack gadgets. For vtable CFI, that’s “find a subclass vtable that has my desired method at the right offset” — almost always impossible for the full chain the attacker needs.

There are real attacks against CFI (control-flow bending, data-only attacks that don’t hijack control flow, etc.), but raising the bar from “any gadget anywhere” to “must be a specific type of function” eliminates the vast majority of off-the-shelf exploitation techniques.

Control Flow Integrity in LLVM: How the Compiler Fences Off Your Function Pointers