Visual Regression: Why Pixel-by-Pixel Comparison Lets Real Changes Slip Through

Visual Regression: Why Pixel-by-Pixel Comparison Lets Real Changes Slip Through

Most open-source visual regression tools — BackstopJS, Wraith, or the screenshot checks in Playwright and Cypress — rely on the same idea: take a screenshot before and after, then count the pixels that changed. The engine is almost always the same: compare the two images pixel by pixel.

It's simple, robust, and perfect for catching a major layout break. But this approach carries a structural trade-off that few people measure. We did — with the numbers to back it up — and on cases chosen to expose it, the result is clear-cut.

Two settings you must not confuse

The confusion is everywhere, so let's clear it up. This kind of tool has two distinct settings:

  1. Per-pixel tolerance: how big a color gap before a pixel is considered to have "really" changed. It's a color-sensitivity dial, applied pixel by pixel.
  2. The decision threshold: once the differing pixels are counted, you look at what percentage of the image changed to decide whether the test passes or fails. The reference tool BackstopJS sets it by default to 0.1% of the pixels.

Our test follows exactly this logic: we count the differing pixels (tool set to its default sensitivity), then we apply the 0.1% decision threshold to say pass or fail.

The threshold dilemma

This percentage threshold locks the whole-image comparison into a trade-off it cannot win:

  • With a percentage threshold (default setting, 0.1%) → you let small real changes slip through. A button that changes color, a border that rounds off, a cell status going "OK" → "ERROR": all of these weigh a tiny fraction of the pixels. Below 0.1%, the test goes green even though the page has visibly changed. A real change goes unnoticed.
  • With no tolerance at all (the other extreme, Playwright's default setting) → you fail on the slightest pixel of difference, including plain edge smoothing — those semi-transparent pixels at the edges of letters and shapes, which vary slightly from one render to the next. The result: false alerts galore, and the team ends up ignoring the tool.

Real tools offer guardrails — masking zones, ignoring regions, cropping. They work, but you have to configure them in advance, by hand, zone by zone. Localized sensitivity is not automatic.

The test in real conditions

We compared, on exactly the same page (rendered by a browser, frozen reproducibly, 1280px screen width, full-page capture), two approaches:

  • Delta-QA: our comparator, which compares element by element (it matches elements between the two versions and only compares pixels at the finest level);
  • whole-image comparison: we compare the two screenshots pixel by pixel, then apply the 0.1% decision threshold.

Five cases, chosen to expose the blind spot:

Case Change Delta-QA (element by element) Whole-image comparison (% of pixels changed, verdict at the 0.1% threshold)
Move a triangle moves within the page 2 signals: gone + appeared on the element 0.005% → NOT DETECTED
Reorder cards and menu reordered 13 localized signals (which cards, which items) 0.63% → fails, but diffuse blob, nothing specified
Subtle change border rounded from 12px to 30px 1 signal on the affected card 0.011% → NOT DETECTED
Localized color a card header green → purple the right header (intensity 0.996) + 2 weak parent signals 5.03% → fails, but diffuse blob, nothing specified
Table cell a row's status FAIL → WARN 2 strong signals on the right rows 0.036% → NOT DETECTED

Across these five cases, three real changes — an element move, a border rounding, a cell status — fall below the standard 0.1% decision threshold. A tool configured with defaults declares them "no change". Yet these are exactly the regressions a visual test is supposed to catch.

For the two cases the pixel-by-pixel comparison does "detect" (reorder, color), it only says one thing: "X% of pixels changed somewhere". No idea which element, nor what kind of change. Delta-QA, on the other hand, names the exact element and qualifies the change (moved, added, removed, modified).

Why the element level changes everything

Delta-QA does not compare one big image. It:

  1. rebuilds the tree of the page's elements;
  2. matches each element between the two versions (by its content, then its position);
  3. only compares pixels at the finest level, and spots a block's own changes by ignoring the zones of its already-changed sub-elements;
  4. sets aside edge smoothing from the count of genuinely differing pixels.

The consequence: it can be highly sensitive (catch a 1px border on a large block) without drowning in smoothing variations, because that noise is set aside and every signal is tied to a specific element. A move isn't a "red blob": it's an element flagged gone at the old position and appeared at the new one. Localized sensitivity is automatic, with no mask to prepare up front.

Methodology — and its limits

We care about rigor, so here's the gray area:

  • The same page for both. Both approaches start from exactly the same page, rendered and frozen — no display bias.
  • Numbers checked against the reference tool. Our test bench recomputes the color difference the same way as the most widely used pixel-by-pixel comparison tool. We cross-checked the 5 cases with that official tool: on the bold color change, the two give 5.036% vs 5.034% — nearly identical. On the other cases, the reference tool counts even fewer pixels (it ignores edge smoothing) — so it is even more prone to letting small changes slip through. The table figures are its own.
  • Delta-QA over-reports (and we own it). On the color change, it emits 3 signals: the real one (the header, intensity 0.996) + 2 very weak parent signals (0.005 and 0.001). This is deliberate: we surface everything, and the UI's sensitivity slider hides those weak signals by default. But to be clear: the raw count is not "1 change = 1 signal".
  • A single test context. These measurements are made at a single screen size, page at rest, on controlled test pages. We claim nothing about multiple screen sizes, interactive states (hover, focus) or genuinely noisy pages — other projects.
  • Reorder. Delta-QA classified the reordered cards as "modified" rather than "moved", but localized by element — which is still far above a diffuse blob.

And to be fair: comparing the whole image is simple, needs only one capture per page, stays excellent for a major break, and its zone masks work. The problem isn't that it's bad — it's that it forces you to trade off between letting the subtle slip through and crying noise, and to set precision by hand.

Key takeaways

If your visual regression test relies on a whole-image comparison with a percentage threshold at the default setting, it probably lets through changes your users see — moves, localized colors, micro style changes. Lowering the threshold catches them, but wakes up the false alerts; masks help, but are configured zone by zone, in advance.

Comparing element by element is not a setting: it's a different architecture, one that gives back both sensitivity and precision — with, as a bonus, the element name and the nature of the change.

Further reading


Reproducible test: the "whole-image" comparison was produced with the reference open-source tool (the pixelmatch package, Node/npm), set to its default sensitivity, then with a 0.1% decision threshold like BackstopJS — on exactly the same frozen page as Delta-QA.