A First AI Baseline Must Be Repeatable

April 3, 2026 7 min read

baseline
measurement

The first audit should behave like a measuring stick left on the wall: plain, marked, and useful next month when everyone forgets how tall the problem was.

The first baseline often arrives as a folder of screenshots. Twelve images, maybe twenty. A few are flattering. A few are worrying. Nobody knows whether the prompts were run once or repeated, whether the language settings changed, whether the competitors were named in advance, or whether the cited sources were copied anywhere. The folder looks like evidence. It is closer to souvenirs.

I have seen the same pattern with regional service businesses, agencies and B2B providers. A composite picture: a 42-person service network with several branches wants to know whether AI systems understand its coverage area. The first internal test shows one good ChatGPT answer and one poor Copilot answer. The commercial director wants action. The marketer wants more data. One branch manager complains that the prompt does not match how customers speak. All three are right enough to create noise.

A baseline is an instrument, not a slide

The first AI visibility audit has a simple job. It should create a measurement that can be repeated. If it cannot be rerun next month, it is not a baseline. It may still be a useful exploration, but it will not tell the company what changed.

This sounds strict. It saves trouble. A one-time diagnostic can describe what happened during one session. A baseline defines how the business will measure the same question again. The difference is like writing a note on a café receipt versus drawing a mark on a workshop wall. The receipt may contain the truth. The wall mark can be compared.

An AI visibility baseline is a repeatable record of prompts, engines, languages, named competitors, cited sources and description accuracy, because future change only matters against a stable first measurement.

The stable part does not mean the market is stable. It means the method is stable enough to expose movement. Engines will change. Sources will shift. Competitors will publish. Pages will be edited. The baseline gives those changes somewhere to show up.

When I build a first audit, I separate two phases. Exploration comes first. I try messy prompts, brand prompts, category prompts, French prompts, English prompts, local prompts, competitor prompts. I let the strange answers teach me where the evidence breaks. Then I choose the baseline set. The baseline is smaller, cleaner and documented. Mixing those phases creates confusion. Exploratory prompts are useful for discovery. Baseline prompts are useful for comparison.

The prompt set should come from buyer language

A repeatable baseline begins with prompts that deserve to be repeated. Internal service names rarely do enough. They reflect how the company files its work, not always how a buyer asks for help. In the composite service network, the company wanted to test a formal phrase for maintenance contracts. Branch staff kept saying that customers used plain problem language: “chaudière immeuble panne régulière,” “contrat entretien chauffage copropriété,” or “plombier urgence fuite local commercial.” The baseline needed both professional and plain wording, but the plain wording could not be ignored.

I build prompt sets from several sources: sales questions, contact-form messages, service pages, local phrases, competitor comparisons and the awkward phrases people use before they know the correct term. I do not need a massive corpus. I need enough variety to cover the buyer situations that matter.

For a first baseline, I like to label each prompt by intent. Discovery, comparison, local service, urgent need, commercial maintenance, multilingual check, competitor alternative. The labels make the audit easier to read later. If visibility improves only in discovery prompts and stays weak in local service prompts, the team should see that without rereading every generated answer.

The prompt wording must be stored exactly. A small change can alter the answer. “Best provider for heating maintenance in Rennes” and “who handles heating maintenance for small hotels near Rennes” are not the same measurement. One pushes toward a ranked recommendation. The other asks for a practical supplier. Both may matter, but they should not impersonate each other in the ledger.

This is where the first audit earns future trust. When someone asks why the June result differs from the April result, the answer should not be “we think the prompts were similar.” The answer should be a row in the ledger.

Engines and languages need separate first readings

A baseline that blends engines too soon becomes hard to interpret. ChatGPT, Perplexity, Copilot and Google AI Overviews do not expose the same answer shape, source behavior or citation style. One may cite a page clearly. Another may name businesses without useful source visibility. Another may be more sensitive to local phrasing. If the first audit produces one combined score, the team loses the ability to know where the movement came from.

I keep separate engine columns in the first reading. A company may be visible in Perplexity because source citation is strong, less stable in ChatGPT, absent in Copilot and partially present in Google AI Overviews for search-like prompts. That unevenness is not a defect in the report. It is the report.

Languages deserve the same separation. French and English prompts often route through different sources. A French SMB with English pages may appear differently when the prompt is written in English, especially if trade directories, vendor pages or international summaries describe the company. The English answer can be useful evidence, but it should not be averaged into French buyer visibility without a label.

In the composite service network, French prompts were more diagnostic for local branches. English prompts exposed a separate problem: an old English description made the network sound like emergency plumbing only. If those results had been blended, the team might have missed both the local weakness and the English misdescription.

I use a term for this first split: baseline panes. Each pane holds one engine and one language view before a combined management summary is written.

The panes are not there to make the report longer. They prevent false certainty. If next month the combined score drops, I want to know whether the drop came from one engine, one language, one service line or one source shift.

Presence, citation and accuracy are different fields

A first audit often fails because it asks one question: “Do we appear?” That is the easiest field to count. It is also the easiest field to overvalue. A business can appear and still lose the buyer because the answer cites a weak source, places the name last, or describes the offer wrongly.

I separate presence, position, cited source and description accuracy from the beginning. Presence answers whether the business is named. Position records where it sits in the answer. Citation records which source, if any, supports the mention. Description accuracy records whether the answer states the offer, location, customer type and relevant facts correctly.

For the regional service network, one branch was present in several answers but described as emergency-only. Presence looked positive. Accuracy failed. Another branch was absent from city prompts but appeared in a broad regional prompt. Presence depended on geography. A third branch appeared only when a competitor was named in the prompt, suggesting the engine understood it as an alternative but not as a default answer.

Those distinctions change the recommendations. Absence may require clearer local evidence. Weak citation may require better source alignment. Wrong description may require correcting the page or third-party source that feeds the error. Low position may indicate stronger competitors. The first baseline should show which problem exists before anyone edits copy.

A baseline that records only presence is like a delivery note that says the parcel arrived, while the box is wet, torn and addressed to the wrong floor.

Competitors make the baseline honest

Some businesses resist including competitors in a first audit. They want to know their own visibility first. I understand the instinct. It still weakens the measurement. AI answers are comparative by shape, even when the prompt does not say “compare.” If an engine names three companies, your visibility depends partly on who else appears and why.

Competitor tracking in a baseline does not need to become a spy exercise. I record competitor names, order, cited sources and description quality in the same answer observations. Then I look for repeated patterns. Are the same competitors cited across engines? Does one competitor dominate English prompts? Does a local rival appear only for one city? Are directories taking answer space that should belong to businesses?

In the service-network scenario, the baseline showed that one local competitor appeared for commercial maintenance because its page used plain customer language and named the building types served. The client’s page was more complete, but it hid the same information under internal service headings. This did not mean copying the competitor. It meant the answer engine had easier evidence elsewhere.

Competitors also help calibrate whether the category is mature in AI answers. If every run names random businesses and weak sources, the category may be unstable. If the same few competitors appear with strong citations, the business faces a clearer visibility gap. Those conditions require different patience and different corrections.

Without competitors, a baseline can flatter absence. “We appeared twice” sounds decent until you see that a rival appeared ten times with cleaner citations.

The report must preserve the method

The most useful part of a first baseline is not the polished recommendation section. It is the method record. I want the next run to be possible without asking me what I meant. The prompt text, date, engine, language, location intent, run count, competitor set, answer notes and source logs should be clear enough that another careful person could repeat the measurement.

That does not mean the report should be ugly. It means the beauty should not hide the machinery. A short executive note can explain the main pattern. A few charts can help. But the ledger remains the base. If a recommendation cannot be traced back to rows in the ledger, I treat it as a suspicion, not a finding.

The first audit should also mark what is excluded. Maybe Google AI Overviews did not trigger for enough prompts. Maybe English prompts are secondary because the buyer base is French. Maybe one service line is postponed because the company lacks enough pages to test fairly. Exclusions are not weakness. They keep the baseline honest.

When the next monthly monitoring run arrives, the baseline becomes useful in three ways. It shows movement in repeated prompt cells. It reveals new errors against the old accuracy field. It shows whether source changes are helping or hurting. Without that first structured record, every later conversation starts from memory, and memory is a poor analyst after a busy month.

A first audit is not supposed to answer every question. It should make the next question measurable.

The Measurement Note — Signal: the same prompt set can be rerun and compared without reconstructing the method from memory. Distortion: treating a screenshot folder as an audit. Ledger: record exact prompts, engines, languages, run dates, competitors, cited sources, positions and description errors. Next Test: convert ten exploratory prompts into a fixed baseline set, then mark which fields must remain unchanged for the next monthly run.