Direct answer
An AI visibility measurement is reliable provided it is repeated. An AI never answers exactly the same way twice: querying a model a single time doesn't give you a measurement, it gives you a snapshot. Reliability comes from repetition (asking the same question many times), from computing the results mechanically, and from a stable protocol that is sealed and verifiable. Without these conditions, a "visibility score" isn't wrong, it's simply unreliable, because it doesn't tell you how stable it is.
The problem
It's the objection everyone has in mind, and it's a fair one: "if the AI changes its answer every time, how could you possibly measure anything?"
It's fair because it's true. Ask ChatGPT the same question twice and you'll often get two different answers. Ask Claude and Gemini, and the gap can be enormous. It's the very nature of these models: they don't recite a stored truth, they generate a likely answer, and what's likely keeps shifting.
Most tools on the market dodge this difficulty instead of facing it. They ask the question once, grab an answer, turn it into a score, and display it to you as if it were carved in stone. That's comfortable to sell. It's misleading to use.
The idea to grasp
Let's use an image that speaks for itself. In a courtroom, no one is convicted on the basis of a single witness, however convincing. Testimonies are cross-checked, versions are confronted, the evidence is verified. A lone witness on the stand, however brilliant, does not establish a judicial truth.
An AI answer is that witness. Extraordinarily convincing, and alone on the stand. Taking it at its word means convicting on a single testimony.
The solution isn't to abandon measurement, it's to cross-check. In practice, measuring a brand's visibility reliably requires three steps:
- Repeat: ask the same query a large number of times (say 20), to see not one answer but the distribution of answers.
- Quantify the spread: if the brand appears in 7 answers out of 20, that's a presence of 35% — and this fraction, measured on a sample, reflects the spread measured on that sample.
- Compute mechanically: count the appearances, the ranks, the citations — without asking an AI to "judge" the result (which would add instability on top of instability).
A model's instability doesn't disappear. But once measured and quantified, it becomes useful information in its own right: a brand whose presence varies from 35% to 100% depending on how it's queried learns something important about its visibility.
What you hear everywhere
"AIs hallucinate and change all the time, so it's impossible to measure." A false shortcut. It's precisely because it varies that you have to measure rigorously, instead of looking just once. We measure the weather, which is unstable, perfectly well — because we repeat the observations and model the uncertainty.
"Just ask the question and see for yourself." See, yes. Measure, no. A manual query gives you an impression. An impression is not data you can base a budget decision on.
"Our score is reliable, it's updated continuously." Continuously updating a number that isn't repeated is just instability displayed in real time. Refresh frequency is no substitute for repeating the measurement.
And that's exactly where my stance comes in: no trust in the AI, nor in the vendor, only in the facts. An AI answer is not a fact. A repeated, quantified measurement, with its uncertainty stated — that's starting to be one.
My take: instability is measured, not ignored
From here on, the register changes: we describe the instrument.
Making an AI visibility measurement reliable rests on explicit methodological choices:
- n=20: each query is asked twenty times in production, to capture the real distribution of answers, not an isolated case.
- Stable protocol: the measurement follows an identical protocol every time, validated by our convergence tests — that's what makes it reproducible.
- Mechanical computation: all aggregation (presence, rank, share of voice, stability) is done in code. The AI never scores anything.
- Measuring stability itself: the consistency of answers becomes an indicator in its own right (a brand can be highly present and highly unstable — that's information).
- Sealing: the report is dated and signed, therefore reproducible and defensible.
In concrete terms, what makes the measurement reliable? The same question is asked n=20 times, under strictly identical conditions, following a stable protocol validated by our convergence tests. The rate obtained (35%, for example) is an observed fact: what the AI answered, at a given moment, across those 20 queries — not an impression drawn from a single attempt. The more consistent the answers, the more stable the measurement.
Why it matters. A number pulled from a single attempt passes itself off as a certainty. A value observed over n=20 queries, on the other hand, tells you what the AI actually answered — that's the difference between an impression and a measurement.
How it's guaranteed. Everything is aggregated mechanically in code, never by the AI, then the result is sealed (HMAC signature, dated) and publicly verifiable — reproducible and defensible.