# AEO Site Checker — full reference

> A free Lighthouse-style auditor that grades any URL on **Answer Engine Optimization** — how well a page is set up to be cited by AI search engines (ChatGPT, Claude, Perplexity, Google AI Overviews). Scores 0–100 across 27 weighted checks in five categories.

This document is a self-contained reference. It describes every check the auditor runs, the exact weight of each, the API surface, and the limitations. AI agents and humans alike should be able to read this end-to-end without needing to clone the repository.

Public URL: https://aeositechecker.huzi.party

---

## What "AEO" means in this tool

When somebody asks ChatGPT (or Claude, Perplexity, or Google AI Overviews) for a recommendation, the response includes one to three cited sources. The model picks those citations from a different mix of signals than Google's PageRank-era SEO checklist:

- **Crawler reachability.** The model has to be able to fetch the page in the first place. Cloudflare bot challenges, 403 walls, and JS-only single-page apps are the most common reasons a site is invisible in AI answers despite ranking on Google.
- **Permission to crawl.** The AI bot has to be allowed in `robots.txt`. Many sites accidentally block `OAI-SearchBot`, `Claude-User`, or `PerplexityBot` while leaving Googlebot allowed.
- **Parseable structure.** Semantic HTML, JSON-LD, and Mozilla-Readability-friendly article structure are easier for an LLM to extract from.
- **The `/llms.txt` convention.** A small markdown file at the site root that gives an AI agent a curated index of important pages — like `sitemap.xml` but optimized for LLMs to read instead of crawlers to follow.
- **Content shape.** Princeton's GEO paper measured what actually moves the needle in generative answers — front-loaded direct answers, statistics density (+41% citation lift), quotations (+28%), citations to authorities. These are heuristic but directionally correct.

The tool produces a single 0–100 score and a letter grade so you can track changes over time.

---

## Scoring rubric: 100 points across 27 checks in 5 categories (+1 type-conditional)

Letter grade: **A** ≥ 90 · **B** ≥ 80 · **C** ≥ 70 · **D** ≥ 60 · **F** otherwise.

### Fetchability — 43 points

| Check ID | Weight | What it measures |
|---|---:|---|
| `fetch_direct` | 18 | Direct (no-proxy) fetch returns 2xx/3xx with no bot challenge. This is the heaviest single check because it's the one an LLM crawler has to pass too. |
| `https` | 4 | URL uses HTTPS. |
| `page_size` | 3 | HTML body is under ~2.5 MB. |
| `robots_ai_bots` | 10 | `robots.txt` allows the critical AI bots: `ChatGPT-User`, `OAI-SearchBot`, `Claude-User`, `Claude-SearchBot`, `PerplexityBot`, `Perplexity-User`, `Google-Extended`. Each blocked critical bot costs 3 points. |
| `ssr_content` | 8 | Body has at least 50 words of server-rendered text (not a JS-only `<div id="root"></div>` shell). |

### Core SEO — 21 points

| Check ID | Weight | What it measures |
|---|---:|---|
| `seo_title` | 4 | `<title>` between 25 and 65 characters. Half credit for any non-empty title. |
| `seo_meta_description` | 4 | `<meta name="description">` between 80 and 175 characters. Half credit for any non-empty description. |
| `seo_canonical` | 3 | `<link rel="canonical">` present. |
| `seo_opengraph` | 3 | `og:title`, `og:description`, `og:image`. One point per tag, max 3. |
| `seo_twitter_card` | 2 | `<meta name="twitter:card">` present. |
| `seo_lang` | 1 | `<html lang="…">` set. |
| `sitemap` | 4 | `sitemap.xml` (or `sitemap-index.xml`, `sitemap_index.xml`, `sitemap-0.xml`) reachable and well-formed. Full credit only if also referenced in `robots.txt`. |

### Semantic HTML — 13 points

| Check ID | Weight | What it measures |
|---|---:|---|
| `semantic_h1` | 3 | Exactly one `<h1>`. Partial credit if more than one. |
| `semantic_heading_hierarchy` | 3 | Headings progress by at most one level (no `<h1>` followed by `<h3>`). |
| `semantic_landmarks` | 4 | Page has `<main>` plus (`<header>` or `<nav>`) plus `<footer>`. |
| `semantic_alt_text` | 3 | At least 95% of `<img>` elements have an `alt` attribute (empty `alt=""` counts as decorative-OK). |

### Answer Engine signals — 22 points

| Check ID | Weight | What it measures |
|---|---:|---|
| `aeo_llms_txt` | 5 | `/llms.txt` exists, has an H1, and has at least one markdown link list entry. Partial credit (2 pts) for malformed files. |
| `aeo_llms_full_txt` | 1 | `/llms-full.txt` exists with at least 200 bytes of body. Bonus, not required. |
| `aeo_structured_data` | 6 | JSON-LD with high-impact types: `Article` / `LocalBusiness` / `Organization` (4 pts), `FAQPage` / `HowTo` (1), `BreadcrumbList` (0.5), `WebSite` (0.5), capped at 6. |
| `aeo_authority` | 3 | Author byline (article `author` or `Person` schema) plus `Organization.sameAs` links pointing to Wikipedia or social profiles. |
| `aeo_freshness` | 2 | `dateModified` in structured data is within the last 12 months. Half credit between 12 and 24 months. Zero past 24 months. Default partial credit (1 pt) if no `dateModified` (acceptable for evergreen pages). |
| `aeo_readability` | 5 | Mozilla Readability extracts a readable article. 5 pts at 1500+ characters, 4 at 500+, 2 at 1+, 0 if Readability returns nothing. |

### Content quality — 6 points

| Check ID | Weight | What it measures |
|---|---:|---|
| `content_front_loaded_answer` | 2 | Opening paragraph is 25–220 words and doesn't start with a fluff opener like "In today's…", "When it comes to…", "Welcome to…". Half credit for any 15+ word opener. |
| `content_question_headings` | 1 | At least one `<h2>`/`<h3>` shaped like a user question (starts with what / how / why / when / etc, ends in `?`). |
| `content_statistics` | 2 | Body text contains numerals + units (`%`, `million`, `reviews`, `bedrooms`, `years`, etc.). 2 pts at 4+ matches, 1 at 2+. |
| `content_quotations` | 1 | Page contains at least one `<blockquote>` or `<q>` element. |

---

## How a single audit runs

1. **`smartFetch(url)`**: undici GET with a desktop-Chrome user agent, follows up to 5 redirects, decompresses gzip/brotli/deflate manually (undici's `request()` doesn't auto-decompress). Reads up to 4 MB of body.
2. **Bot-block detection**: looks for `cf-mitigated` header, HTTP 403/503/429 from a Cloudflare-fronted host, and a battery of body signatures (`Just a moment...`, `__cf_chl_opt`, `cf-browser-verification`, `_pxCaptcha`, etc.). If any match, the fetch is marked `blocked`.
3. **Fallback**: if direct fetch fails or is blocked, retry through BrightData Web Unlocker. The audit records which mode succeeded; sites that needed the unlocker lose the major `fetch_direct` credit.
4. **Run all checks in parallel**: `robots.txt`, `llms.txt`, `llms-full.txt`, `sitemap.xml` are fetched concurrently. The HTML is parsed once with cheerio and once with jsdom (for Readability).
5. **Score**: total earned ÷ total weight, normalized to 0–100, letter grade applied.
6. **Persist**: every audit is saved to SQLite with a cuid, so results have a shareable permalink at `/audit/:id`.

---

## API

| Method | Path | Body / params | Returns |
|---|---|---|---|
| `GET` | `/api/health` | — | `{ status: "ok" }` |
| `POST` | `/api/audits` | `{ "url": "https://…" }` | Full `AuditResult` with permalink `id` |
| `GET` | `/api/audits/:id` | — | A previously-saved audit |
| `GET` | `/api/audits/recent` | — | Last 25 audits (id, url, score, grade, mode, createdAt) |

The hosted instance is open and unauthenticated. There is no rate limiting at the application layer.

---

## Limitations and non-goals

- **No JavaScript rendering.** The auditor reads server-rendered HTML only. Sites that need JS to populate content fail `ssr_content`, which is intentional — LLM crawlers also don't run JS reliably. If you need a Lighthouse-style headless-browser audit, this isn't that tool.
- **Single URL per audit.** No site-wide crawl, no `<a>` link following.
- **Heuristic content checks.** "Front-loaded answer" and "question-shaped heading" are regex heuristics, not language understanding. They're useful directional signals, not absolute truth.
- **No historical comparisons.** Saved audits are independent rows; there is no diff view.

---

## References

- llms.txt specification — https://llmstxt.org/
- Princeton GEO paper (Generative Engine Optimization) — https://arxiv.org/abs/2311.09735
- Cloudflare bot-signal reference — https://developers.cloudflare.com/cloudflare-challenges/challenge-types/challenge-pages/detect-response/
- OpenAI crawler list — https://platform.openai.com/docs/bots
- Anthropic crawler list — https://docs.anthropic.com/en/docs/agents-and-tools/web-search-and-fetch
- Perplexity crawler list — https://docs.perplexity.ai/guides/bots
- Google AI crawlers — https://developers.google.com/search/docs/crawling-indexing/google-controlled-crawlers