The Princeton GEO paper, summarized

In late 2023, a team from Princeton, Allen Institute for AI, and IIT Delhi published GEO: Generative Engine Optimization (Aggarwal et al., arXiv:2311.09735). It’s the first rigorous empirical study of what makes a webpage more likely to be cited by a generative search engine.

The paper underpins the Content Quality category of our scoring rubric. Here’s what it actually says, and the parts everyone gets wrong.

What they did

The authors built GEO-Bench — a benchmark of 10,000 queries across 9 domains (history, opinion, factual, debate, etc.). For each query they:

Used a generative engine (specifically: a retrieval-augmented language model with multiple sources) to produce an answer with citations.
Took the ranked list of source documents and modified each one with a candidate strategy.
Re-ran the engine and measured whether the modified source was cited more or less often.

They tested 9 candidate strategies, including:

Adding statistics to the source
Adding quotations from named people
Adding citations to other sources within the source
Authoritative tone (rewriting the prose to sound more confident)
Fluency optimization (clean prose)
Easy-to-understand simplification
Keyword stuffing (the SEO classic)
Unique words
Cite sources

Each was evaluated on citation rate (how often the source was cited) and subjective impression (how prominently in the answer).

The headline findings

The two strategies that moved the needle most:

Statistics — +40.6% relative citation rate

Adding 2–3 numeric statistics to a source — well-formatted and contextually relevant — increased its citation rate by a relative 41%. The effect was especially strong on factual queries.

“Generative engines disproportionately surface and cite content that contains specific, verifiable numbers. The model treats numeric facts as ‘extractable units’ that can be reused in the response with low rewrite cost.”

In practice, this means: if you’re writing a piece about plumbing, “Morrison Plumbing has a 4.9 average rating across 437 reviews and a $89 diagnostic fee” is dramatically more likely to be cited than “Morrison Plumbing has great reviews and reasonable rates.”

Quotations — +27.6% relative citation rate

Adding direct quotations from named experts — formatted as <blockquote> or “[Name] said: ’…’” — increased citation rate by 28%. The effect was strongest on opinion and debate queries.

“Quotations function as social proof and provenance. They give the model a citable atomic unit attributed to a named entity, which the model can re-cite with high confidence.”

The other strategies, ranked

The full ranking from the paper:

Cite sources — +30.4%
Quotation addition — +27.6%
Statistics addition — +40.6%
Fluency optimization — +15.4%
Easy-to-understand — +14.4%
Authoritative tone — +12.6%
Unique words — −0.4% (no effect)
Keyword stuffing — −9.7% (negative)
Technical terms — −5.3% (slight negative)

The two negatives — keyword stuffing and overly technical jargon — are the inverse of what classic SEO has rewarded for 20 years. Generative engines don’t care about keyword density, and they actively penalize prose that sounds like it’s been keyword-tuned.

What everyone gets wrong about this paper

Misreading 1: “+41% means add stats to everything”

It does not. The +41% number is relative, on a controlled benchmark, with carefully composed inserted statistics. Random number-stuffing — claiming “70% of users prefer our product” without a source — produces a citation drop, not a lift, because the model can’t verify the number and the surrounding prose looks slop-generated.

The applicable rule: add statistics that are real, sourced, and load-bearing for the claim you’re making.

Misreading 2: “Just optimize for the GEO metrics and you’ll win”

The GEO study measured the source after it was already retrieved. It does not measure whether your site gets retrieved in the first place — that’s an upstream problem (Fetchability, JSON-LD, llms.txt, the things in the rest of the AEO rubric).

A page that fails Fetchability scores zero in any GEO experiment because it never enters the candidate set.

Misreading 3: “It generalizes to all generative engines”

The paper tested a specific RAG pipeline. ChatGPT Search, Perplexity, and Claude all use somewhat different retrieval and citation models. The direction of the findings (statistics +, quotations +, keyword stuffing −) replicates across engines, but the magnitudes don’t.

What replicates:

Statistics and named quotations help.
Front-loaded answers help.
Keyword stuffing hurts.

What’s engine-specific:

The exact citation-rate lift from any single change.
Which schema types the engine prefers.
How aggressively it pushes through bot challenges.

The takeaway for AEO

If you’re writing content for AI search, the GEO paper gives you a small, evidence-backed playbook:

Lead with the answer. Don’t bury it. The model often reads the first ~200 words and stops.
Use specific numbers. Statistics with units, percentages with denominators, dates with sources.
Quote named people. A line from a real expert with a real attribution moves citations.
Cite your own sources. Hyperlinks to primary sources function as provenance for the model.
Don’t keyword-stuff. It’s negative-EV for AI search.

In our scoring rubric, four of these become explicit checks worth a combined 6 points. They’re a small fraction of the total — but they’re high-leverage, because the cost of doing them is mostly editorial discipline.

Reading the paper

The full paper is at arXiv:2311.09735 (35 pages, 9 figures). The most useful sections:

Section 4 describes the experimental setup.
Section 5.2 is the results table you’ll actually want.
Appendix A includes example before/after edits for each strategy — useful for getting a feel for what “add a statistic” actually means in practice.