How LLMs Reason About Code Vulnerabilities

This is my first note in what I hope becomes a running record of things I work through, understand, and sometimes get wrong before getting right.

The core question

Can a model that learned from text predict whether a piece of code is vulnerable — without being told the rules?

The short answer is: sometimes, and in ways that are still surprising even to researchers in the field. The longer answer is what I want to work through here.

Pattern matching vs. reasoning

The dominant failure mode I see in LLM-based vulnerability detection is shallow pattern matching. A model trained on enough CVE descriptions and patch commits learns to recognize surface-level signals: a strcpy call, an unchecked return value, a malloc without a corresponding free. These correlate with bugs, but correlation is not causation.

The harder problem is semantic understanding — can the model trace the flow of tainted data from a user-supplied input to a sensitive sink? Can it reason about inter-procedural control flow? Does it understand when an apparent use-after-free is actually guarded by a flag set elsewhere?

In my experience working on VulnLLMEval, the answer is: frontier models (GPT-4, Claude 3.5) show genuine reasoning on simple cases, but degrade quickly with complexity, especially across function boundaries.

Why zero-shot is harder than it looks

In our recent preprint, we push models to detect vulnerabilities with no examples — just a description of the task and the code. What we find is that chain-of-thought prompting helps significantly: asking the model to explain what could go wrong before giving a verdict forces it to generate an intermediate representation that resembles a manual code review.

But the gains are uneven. Buffer overflows and format string bugs benefit from CoT. Logic errors and time-of-check-time-of-use (TOCTOU) races are nearly unchanged — the model still guesses.

What I think is actually happening

My working hypothesis is that LLMs are doing something like retrieval-augmented pattern matching at the semantic level. They have internalized enough code and enough descriptions of bug classes that they can produce fluent, plausible-sounding analysis. But fluency is not fidelity.

The models that do best are the ones that have been exposed to the most diverse code corpora — not because diversity alone helps, but because diverse code forces the representation to generalize beyond surface syntax.

Where this leaves us

Vulnerability detection with LLMs is genuinely useful today as a triage tool: flag the top-N suspicious functions for a human reviewer. It is not yet reliable enough for gate-keeping (i.e., blocking a merge if the model says the code is safe). The false negative rate is too high.

Getting there probably requires better benchmarks, better training signal (beyond binary labels to span-level annotations), and models that can maintain context across an entire file or module.

That’s what I’m working on. More notes as I make progress.