The day a refactor passed on my laptop and failed on yours

Most of the code being written right now is not being written. It is being generated, glanced at, then merged. The reviewer is tired. The diff is large. Increasingly the reviewer is itself a language model summarising the work of another language model. Somewhere in that loop there is supposed to be a moment where someone confirms the change did what it claimed. Often there isn't.

I wanted a small, boring tool to fill that gap. Take a function from before a refactor and after. Run both on the same inputs. Tell me plainly whether the behaviour changed. Not an opinion. Not a confidence score. A result I could rerun next week to the same answer, byte for byte. If a teammate ran it on their machine they should get my exact result, not something close.

That last sentence sounds trivial. It is the entire problem. This is the story of where it broke and why the fix turned out to be the most important design decision in the whole tool.

Why rerunning it is the only claim worth making

There is no shortage of tools that review your pull request. The newer ones are language models with a nice interface. They are useful. They are also the same kind of thing that wrote the code: a probabilistic system giving you its impression. Ask the same one twice and you can get two different reviews. In a world where a model wrote the diff, a model reviewing the diff is the same fallible loop checking its own work.

So I did not want to add another opinion. I wanted a verdict with a property no opinion has: you can reproduce it. Run the check. Get a result. That result is a function of the inputs and nothing else. No wall clock. No network. No luck particular to one machine. Same inputs in, same answer out, on any computer.

If you have that, you can sign it and hand it to someone who does not trust you. They rerun it and confirm it themselves. The trust comes from reproduction, not from my reputation or my model's confidence. That is the whole pitch. It only works if the reproduction is real.

Where it broke: a function that returned a float

Early on the tool handled integers, strings, lists of integers. Clean, exact, the same on every machine. Then I pointed it at a numerical function. A refactor of an averaging routine. The kind of change an AI assistant produces ten times a day.

On my Mac the check said the two versions diverged on one input. On a Linux box in CI it said they were identical. Same code. Same inputs. Two different verdicts.

This is the nightmare for a tool whose only selling point is reproducibility. A verdict that depends on the machine is not a verdict. It is a rumour.

The cause is not a bug in my tool. It is the nature of floating point arithmetic. It is worth understanding, since almost every "we test your AI code" tool will hit it and most will quietly paper over it.

What IEEE 754 promises and what it does not

Floating point numbers follow a standard called IEEE 754. The standard is precise about which operations are guaranteed to give the same answer everywhere. That guarantee is narrower than people assume.

The basic operations are correctly rounded. Addition. Subtraction. Multiplication. Division. Square root. The fused multiply add. Each is required to return the single nearest representable result, every time, on every conforming machine. At double precision with the default rounding mode these operations are identical bit for bit whether you run them on an Apple chip or an Intel server. There is no ambiguity. There is no luck of the platform. Two different expressions built only from these operations will agree across machines or disagree across machines consistently.

The functions you reach for next are not covered. Sine. Cosine. Exponential. Logarithm. Raising to a fractional power. For these the standard only recommends correct rounding. It does not require it. The reason is a genuinely hard maths problem, sometimes called the table maker's dilemma: computing the last bit correctly for these functions can need enormous intermediate precision. Implementations make different tradeoffs. The C maths library on macOS and the one on Linux can legitimately return results that differ in the final bit.

That final bit is exactly what bit me. My averaging refactor touched a function whose two versions agreed to the last bit under one maths library and disagreed under another. Neither machine was wrong. The standard permits both. My tool was trying to render a global verdict on a quantity that is, by design, local.

The decision: refuse what you cannot reproduce, by name

There were two tempting fixes. Both are traps.

The first is to round the results before comparing. Compare to twelve decimal places and call it equal. This feels reasonable. It is not safe. A real difference in the last bit can sit right on the rounding boundary. One machine rounds up. The other rounds down. Rounding does not remove the disagreement. It hides it sometimes and invents it other times. You have traded a guarantee for a coin flip.

The second is to compare with a tolerance. Equal if within some epsilon. Now your tool no longer answers the question it was asked. "Did this refactor preserve the behaviour" has quietly become "is the new behaviour close enough for my taste." For a tool whose only asset is a precise reproducible verdict, that is the asset gone.

The fix that actually holds is less clever and more honest. The tool admits a floating point function only when its computation stays inside the correctly rounded operations. Those are reproducible across machines, since the standard makes them so. The moment a function reaches for a transcendental, the tool does not guess and does not round. It refuses, by name. It says so:

clamp_average  REFUSED  depends on a platform-variable transcendental (math.exp);
                        a cross-host reproducible verdict is not possible here.

Agreement across machines comes from restriction, not from cleverness. Inside the admissible set the raw bits are already identical everywhere. The tool records the result as its exact bit pattern, with no rounding and no massaging. A NaN is normalised to a single canonical form. A NaN payload is not observable behaviour. The sign of a zero is preserved exactly. The sign of a zero is observable: dividing by positive zero and by negative zero gives positive and negative infinity. The details matter. The rule behind all of them is one sentence.

A value is admissible only if the verdict it produces is identical on every machine. Everything else is refused, out loud.

Why refusing is a feature, not a weakness

It is uncomfortable to ship a tool that says "I will not judge this." The instinct is to maximise coverage so the tool looks capable. That instinct is how you end up with a tool that confidently lies a small fraction of the time, which is worse than useless for anything you would actually rely on.

The refusal is the thing that makes every other answer trustworthy. When the tool says two versions are equivalent, it is staking that claim on a verdict it can reproduce anywhere. When it cannot make that promise it tells you. Then you reach for a human or a different technique. You are never handed a green light that was really a shrug.

This is the opposite of the marketing reflex, which is to claim more. The claim here is deliberately small and completely solid: these specific behaviours were checked on these specific inputs, the result reproduces to identical bytes everywhere, here is everything I declined to check. Small and true beats broad and shaky. That is true above all for the one job where you are trying to replace a rubber stamp with something you can stand behind.

What this is, plainly

The tool is called equiv. It runs a changed function and its previous version on the same generated inputs. It reports whether they diverged, with the exact input that broke them when they do. It produces a signed receipt of what was checked, addressed by its content, which anyone can rerun to the same bytes. It is not a prover. It is bounded testing: a pass means no divergence was found on the inputs it tried, not that none exists. It says so. It checks mechanical behaviour, never intent or architecture. It tells you that too.

That is the honest shape of it. In a field full of tools that review your code by having a model form an impression, the contribution here is not intelligence. It is the refusal to pretend. A verdict you can reproduce. A clear list of what was not checked. A flat "no" whenever a yes would not survive being run on a different machine.

The hard part was never generating inputs or comparing outputs. It was deciding, before writing the code, exactly which questions the tool is allowed to answer with certainty, then being willing to say nothing about the rest.