Different AI Engines Need Separate Visibility Ledgers

Engines do not merely answer in different voices. They expose different evidence habits. Put them too early into one score and you sand away the very marks that explain why a business appears, disappears, or gets described badly.

A facilities manager asks a practical question: “Which company handles heating maintenance for several sites in western France?” In one answer engine, a regional plumbing and heating network appears near the top. In another, it is absent. In a third, it appears only after a directory page frames it as an emergency plumber. No page on the company website changed between those checks. The prompt did not become clever. The engine changed the path.

That picture is a composite scenario from local-service measurement work, and it contains the small imperfection that keeps returning in real ledgers: the company is known, but known through the wrong doorway. For one city, the answer looks acceptable. For a nearby town, a competitor wins. For English wording, old evidence becomes louder. If the team asks only, “Are we visible in AI?” the answer is mush. If it asks, “Where does each engine see us, cite us, and misread us?” the work begins.

An engine column is not a decoration

Many AI visibility reports put engine names across the top because the table looks more complete that way. ChatGPT, Perplexity, Copilot, Google AI Overviews. Four labels, some checks, perhaps a score. The format is familiar, and that familiarity is part of the risk. A column should not exist merely to look diligent. It should preserve a difference that changes the recommendation.

Different answer engines may retrieve, summarize, cite, and rank evidence in different ways. I am careful with the word “may” because these systems change, and because outside observers rarely see every internal mechanism. Still, from a measurement standpoint, the result is plain enough: the same business can receive different verdicts across engines for the same buyer prompt. That difference is not noise to erase. It is evidence to read.

A separate engine ledger is a measurement table that keeps each answer engine’s presence, position, cited source and description accuracy apart, because mixed scores hide the source path that produced the answer. That is my working definition. It is less exciting than a dashboard, but more faithful to what the buyer sees.

When the regional service network appears in one engine and not another, I do not immediately ask which engine is “right.” I ask what each engine had available, what it seemed to reward, and which source carried the description. The answer might be that the company’s own maintenance pages are thin, while directory pages about emergency plumbing are plentiful. It might be that the English version of the site lacks local coverage. It might be that competitors have better category pages. The engine split gives the investigation its first set of fingerprints.

ChatGPT may show recognition before source strength

In many measurement runs, ChatGPT can produce a fluent answer that feels like recognition. It may name a business, place it in a plausible category, and explain it in smooth prose. Smoothness is seductive. A buyer may trust it. A marketer may screenshot it. I still want the ledger fields.

The question is not whether the answer reads well. The question is whether the business appears repeatedly, in the right place, with an accurate description and a defensible evidence path where sources are visible or inferable. ChatGPT-style answers can sometimes make weak evidence sound stronger than it is. That is not a moral failing of the tool. It is a measurement hazard.

In the composite heating-network case, ChatGPT might name the company for “heating maintenance western France” and describe it as serving local branches. Good. Another run might narrow the description to emergency repairs because that wording is common in public listings. A third might omit the commercial maintenance angle. If the team only records presence, ChatGPT looks friendly. If it records description accuracy, the pattern becomes less comfortable.

This is why I do not let one engine’s fluency set the tone for the whole audit. A clean paragraph can cover a weak source seam. The ledger has to keep the seam visible.

Perplexity often makes source habits easier to inspect

Perplexity is useful in measurement partly because its answer format tends to foreground sources more clearly. That does not make every answer correct. It makes certain failures easier to see. When a business is named, I can often inspect which cited page carried the weight and whether that page deserves it.

For a French SMB, this source visibility can be bracing. The team may discover that the engine is not reading the carefully edited service page they expected. It may be leaning on a directory, an old partner listing, a trade article, a review platform, or a thin branch page. Sometimes the cited source is not wrong; it is merely too narrow. A directory page that emphasizes emergency plumbing can feed an answer that underplays planned maintenance. The business appears, yet the commercial category is bent.

This creates a useful kind of discomfort. The team stops arguing about the generated sentence and starts reading the cited evidence. In my ledgers, Perplexity rows often become source-tracing rows. Which source is cited? What claim does it support? Does the page mention location, service, customer type, and current offer? Is the source controlled by the company, influenced by the company, or external? Those distinctions shape the correction loop.

I call this failure “citation tilt.” Citation tilt occurs when an engine names the right business while leaning on a source that tilts the description toward the wrong service, location, or customer type. The term is clumsy enough to stick. It also prevents the lazy reading that any citation is a good citation.

Copilot can expose business-context gaps

Copilot often sits inside a different user routine from a standalone answer engine. People may encounter it while searching, working, comparing, or asking a practical question with a web habit already in place. For measurement, I treat it as its own environment, not as a second flavor of ChatGPT. The same prompt can surface a different mix of names and evidence.

The interesting part is how quickly business-context gaps show. A company with decent category evidence may still lose if its public pages do not connect service, geography, and buyer type clearly. A regional heating network that says plenty about emergency repairs and general plumbing may be less legible for “maintenance across six branches.” Copilot may return larger competitors, directories, or pages that state the multi-site context more plainly.

I do not claim this as a fixed rule about the product. It is an observation from ledgers: when the public evidence is scattered, some engines punish the scatter more visibly than others. The answer may not be wrong. It may be selecting the source that best matches the question. If your best page never says the buyer’s problem in a complete way, the engine has to build the bridge itself. Sometimes it chooses another bridge.

This is why separate ledgers matter. If Copilot is weak on multi-site prompts while ChatGPT is acceptable on general category prompts, the fix is not “do more AI visibility.” The fix may be a clearer source page that states the maintenance service, coverage area, branch logic, and customer type in language close to the prompt. That recommendation comes from the split.

Google AI Overviews belong to the search layer

Google AI Overviews should be measured with respect for their search setting. They are not just another chat box. They appear in a search environment where query wording, search results, local signals, and page eligibility all matter. For a French SMB, that makes them especially important and especially awkward to compare directly with chat engines.

A buyer using Google may phrase the question differently from a buyer inside a chat interface. The search may carry local assumptions. The answer overview may appear for some queries and not for others. It may cite pages that already satisfy certain query structures. From the outside, the safest approach is humble: log what appears, when it appears, which sources are shown, and how the business is described. Do not force the result into the same interpretation used for a chat answer.

In the composite service network, an AI Overview for a local query might privilege pages with strong local relevance. A broader maintenance question may show no overview or cite a competitor’s better structured service page. A nearby-town query may reveal weak branch evidence. These are search-layer findings. They should sit in their own ledger before anyone combines them with ChatGPT or Perplexity.

The unpleasant truth is that a business can be visible in conversational answers and still weak in AI answers attached to search. That split matters for French SMBs because many buyers do not wake up deciding which AI surface they are using. They ask where they already are.

Combined scores should arrive late

A combined score is not forbidden. Managers need summary views. Agencies need reporting language. Owners need to know whether the situation is improving. The danger is building the score before the separate engine patterns have been read.

I prefer a two-step rhythm. First, inspect each engine ledger on its own terms: presence, position, cited source, description accuracy, prompt family, language, and location. Then, after the differences are visible, create a summary that does not erase them. The summary might say that visibility is strong in Perplexity because sources consistently cite the company’s maintenance pages, moderate in ChatGPT because descriptions drift, weak in Copilot for multi-location prompts, and unproven in Google AI Overviews for nearby towns. That is a useful executive view because the joints remain visible.

The worst summary says “AI visibility: 62” and leaves everyone guessing. Sixty-two of what? Which engine failed? Which prompt family? Which source? Which language? Which competitor took the slot? A score without a readable underside is a sealed box with a number painted on it.

For the regional service network, the separate ledger might reveal that the company does not need a full website rewrite. It may need stronger maintenance evidence for each location, clearer French wording around commercial contracts, and a correction of directory profiles that overstate emergency work. Another engine may require nothing yet; it already reads the firm correctly. Separate ledgers prevent the team from sanding every surface because one corner is rough.

The point is diagnosis, not engine gossip

I sometimes hear teams talk about engines as if choosing a favorite matters. “Perplexity likes us.” “Copilot ignores us.” “ChatGPT understands us.” Those phrases are understandable, but they blur the work. The engine is not a colleague with a preference. It is an answer surface producing traces from available evidence, query wording, and its own changing system.

The measurement question is more practical: what does each surface repeat often enough for a buyer to see, and which source path explains it? That question keeps the audit inside business reality. The company cannot control every model shift. It can improve public evidence, clarify pages, strengthen sources, watch competitors, and retest.

A separate engine ledger gives you a way to do that without superstition. It shows where the business is named, where it is absent, where it is cited badly, and where it is described through a source that no longer matches the offer. It also shows where no action is needed. That last point matters. Measurement should stop unnecessary work as often as it starts necessary work.

The Measurement Note — Signal: each engine can reveal a different evidence path for the same buyer prompt. Distortion: averaging ChatGPT, Perplexity, Copilot and Google AI Overviews before reading their separate failures. Ledger: record engine, prompt family, language, location, presence, position, cited source and description accuracy. Next Test: run one customer prompt across four engines and write the source difference before assigning any combined score.