Are AI visibility measurements reliable?

Clearing up doubts about the reliability of AI visibility measurements.

Direct answer

An AI visibility measurement is reliable provided it is repeated. An AI never answers exactly the same way twice: querying a model a single time doesn't give you a measurement, it gives you a snapshot. Reliability comes from repetition (asking the same question many times), from computing the results mechanically, and from a stable protocol that is sealed and verifiable. Without these conditions, a "visibility score" isn't wrong, it's simply unreliable, because it doesn't tell you how stable it is.

The problem

It's the objection everyone has in mind, and it's a fair one: "if the AI changes its answer every time, how could you possibly measure anything?"

It's fair because it's true. Ask ChatGPT the same question twice and you'll often get two different answers. Ask Claude and Gemini, and the gap can be enormous. It's the very nature of these models: they don't recite a stored truth, they generate a likely answer, and what's likely keeps shifting.

Most tools on the market dodge this difficulty instead of facing it. They ask the question once, grab an answer, turn it into a score, and display it to you as if it were carved in stone. That's comfortable to sell. It's misleading to use.

The idea to grasp

Let's use an image that speaks for itself. In a courtroom, no one is convicted on the basis of a single witness, however convincing. Testimonies are cross-checked, versions are confronted, the evidence is verified. A lone witness on the stand, however brilliant, does not establish a judicial truth.

An AI answer is that witness. Extraordinarily convincing, and alone on the stand. Taking it at its word means convicting on a single testimony.

The solution isn't to abandon measurement, it's to cross-check. In practice, measuring a brand's visibility reliably requires three steps:

Repeat: ask the same query a large number of times (say 20), to see not one answer but the distribution of answers.
Quantify the spread: if the brand appears in 7 answers out of 20, that's a presence of 35% — and this fraction, measured on a sample, reflects the spread measured on that sample.
Compute mechanically: count the appearances, the ranks, the citations — without asking an AI to "judge" the result (which would add instability on top of instability).

A model's instability doesn't disappear. But once measured and quantified, it becomes useful information in its own right: a brand whose presence varies from 35% to 100% depending on how it's queried learns something important about its visibility.

What you hear everywhere

"AIs hallucinate and change all the time, so it's impossible to measure." A false shortcut. It's precisely because it varies that you have to measure rigorously, instead of looking just once. We measure the weather, which is unstable, perfectly well — because we repeat the observations and model the uncertainty.

"Just ask the question and see for yourself." See, yes. Measure, no. A manual query gives you an impression. An impression is not data you can base a budget decision on.

"Our score is reliable, it's updated continuously." Continuously updating a number that isn't repeated is just instability displayed in real time. Refresh frequency is no substitute for repeating the measurement.

And that's exactly where my stance comes in: no trust in the AI, nor in the vendor, only in the facts. An AI answer is not a fact. A repeated, quantified measurement, with its uncertainty stated — that's starting to be one.

My take: instability is measured, not ignored

From here on, the register changes: we describe the instrument.

Making an AI visibility measurement reliable rests on explicit methodological choices:

n=20: each query is asked twenty times in production, to capture the real distribution of answers, not an isolated case.
Stable protocol: the measurement follows an identical protocol every time, validated by our convergence tests — that's what makes it reproducible.
Mechanical computation: all aggregation (presence, rank, share of voice, stability) is done in code. The AI never scores anything.
Measuring stability itself: the consistency of answers becomes an indicator in its own right (a brand can be highly present and highly unstable — that's information).
Sealing: the report is dated and signed, therefore reproducible and defensible.

In concrete terms, what makes the measurement reliable? The same question is asked n=20 times, under strictly identical conditions, following a stable protocol validated by our convergence tests. The rate obtained (35%, for example) is an observed fact: what the AI answered, at a given moment, across those 20 queries — not an impression drawn from a single attempt. The more consistent the answers, the more stable the measurement.

Why it matters. A number pulled from a single attempt passes itself off as a certainty. A value observed over n=20 queries, on the other hand, tells you what the AI actually answered — that's the difference between an impression and a measurement.

How it's guaranteed. Everything is aggregated mechanically in code, never by the AI, then the result is sealed (HMAC signature, dated) and publicly verifiable — reproducible and defensible.

Where LirenPrism stands

LirenPrism has made measurement reliability the core of mAIr, because it's the condition of its credibility as a neutral third party. Where a conventional tool displays a score, mAIr displays a score and its uncertainty, obtained through repetition and mechanical computation, in a sealed report.

A concrete example from a real measurement: for the same brand, the measured presence rate at one provider could be 35% in answers without web search, and 100% with web search. A tool that queries only once would have displayed one or the other, depending on the luck of the moment. The repeated measurement, by contrast, reveals the gap — and that gap is precisely the information that matters.

In brief

An AI never answers the same way twice: a single measurement is not reliable.
Reliability comes from repetition (n=20), from mechanical computation, and from a stable, sealed and verifiable protocol.
Instability shouldn't be ignored: once measured, it becomes useful information.
mAIr displays the score and its margin, in a dated and sealed report.

Frequently asked questions

If the AI changes its answer, what's the point of measuring?

To turn an unstable impression into quantified data. By repeating, you measure the distribution of answers and their stability — which tells you more than a single query, precisely because it shifts.

How many times do you have to query for it to be reliable?

There's no magic number, but a single measurement is never enough. mAIr uses n=20 in production and measures the spread of answers, for a stable and reproducible result.

Can a brand be both highly visible and unstable?

Yes, and it's common. A brand can appear almost always in one query mode and rarely in another. That gap is measurable — and it's often the most actionable information.