Flaky Visual Tests: Why They Destroy Your QA and How to Stabilize Them

A flaky test (or unstable test) is a test that produces different results — pass or fail — for the same code and the same configuration, without any modification to the system under test.

Here's an opinion that might bother you: a flaky visual test is worse than no test at all. And it's not gratuitous provocation. An absent test costs nothing day-to-day. It doesn't block your pipeline. It doesn't consume your team's time. It doesn't destroy confidence in your test infrastructure. A flaky test does all of that, every day, insidiously — because each false failure looks just enough like a real one to require investigation.

Google's data on their own test systems is telling: approximately 1.5% of their tests are flaky at any given time, but these tests consume a disproportionate share of engineering time. If a company of Google's size and technical competence hasn't eliminated the problem, it's not a trivial question. And in the specific domain of visual testing, the problem is amplified by the very nature of what's being compared.

Why visual tests are particularly vulnerable to instability

Automated visual testing introduces a layer of non-determinism that unit or functional tests don't have. A unit test verifies a function returns the right value. A functional test verifies a button triggers the right action. These results are binary and deterministic.

Visual testing verifies that your interface's render matches a reference image. And a web page's render is the result of a complex chain of processes, each introducing its own variability: HTML parsing, CSS application, JavaScript execution, external resource loading, layout calculation, rasterization, composition.

The four main causes of flaky visual tests

Timing: the problem you can't ignore

The web is asynchronous by nature. When you ask the browser to capture a screenshot, is the page truly ready? The answer is almost always: it depends.

Page loading isn't a single event — it's a cascade of events. HTML is parsed, CSS applied, scripts executed, images load, web fonts apply, API requests return data. Each step has variable duration. The classic strategy of waiting for "ready" — DOMContentLoaded, load event, or network idle — doesn't guarantee visual rendering is complete.

Result: your screenshot sometimes captures a complete page, sometimes a page mid-render.

Animations: movement in a static medium

A screenshot is a fixed image. An animation is continuous change. The two are fundamentally incompatible for automated comparison. If your page contains a 300ms animation that launches on load, the exact capture moment relative to the animation start varies between runs.

Infinite animations (spinners, skeleton loaders) are even worse: there's no "stable" moment to capture the screenshot.

Dynamic content: everything that changes without you

Dates, times, ads, randomly generated avatars, real-time notifications, visitor counters — all vary between test runs. Each variation is detected as a visual difference. Each difference fails the test.

Network and infrastructure: variables you don't control

Your test runs in an environment depending on external resources: API servers, CDN for images and fonts, third-party services. Latency varies between runs. In a CI/CD pipeline, the problem is amplified — your CI runner shares resources with other jobs.

The real cost of flaky tests

The most visible cost is triage time. Each flaky test failure requires investigation: someone must look at results, compare manually, decide if it's real, and relaunch if needed.

But the most destructive cost is invisible: loss of confidence. When a team learns visual tests fail "all the time" for no reason, they develop an automatic relaunch reflex. The day a test fails due to a real bug, the reflex is the same: relaunch. And the bug reaches production.

This phenomenon has a name: the "cry wolf effect." Once established, it's very difficult to reverse.

Stabilization strategies that work

Control the render environment

Use a headless browser in a controlled container with fixed resolution, pre-installed fonts, and deterministic network configuration. Freeze browser version, disable GPU rendering, configure fixed viewport size.

Neutralize animations

Inject a stylesheet that forces all animations and transitions to zero duration. This instantly freezes all animated elements in their final state.

Stabilize dynamic content

Freeze dates and times, disable third-party widgets, mock API data, replace generated avatars with static images in test fixtures. The goal is an environment where the only variable is your interface code.

Wait intelligently

Instead of fixed delays (wait 3 seconds), use state-based waiting strategies. Wait for critical elements to be visible, images loaded, fonts applied, network requests completed.

Adopt a comparison that tolerates noise

Pixel-by-pixel comparison is most sensitive to render non-determinism. Perceptual algorithms (SSIM, pHash) are more tolerant. The structural approach — comparing DOM and computed CSS properties rather than pixels — is most resistant to render noise, because it natively ignores sub-pixel variations causing most flaky failures.

No-code visual testing as a maintenance solution

Code-based visual tests (Playwright, Cypress, Selenium) require scripts that navigate, interact, and capture. These scripts are themselves an instability source: a CSS selector that no longer finds the element, click timing that misses its target.

No-code tools like Delta-QA eliminate this fragility layer. You don't write scripts — you configure tests visually. The tool handles loading, waiting, stabilization, and comparison. When an element changes selector, the tool adapts without intervention.

When to delete a flaky test

If a visual test fails intermittently despite all stabilization attempts, the bravest — and often wisest — decision is to delete it. A flaky test nobody fixes actively degrades your pipeline. It trains your team to ignore failures.

Delete it, document why, and replace it with a more targeted check. The goal isn't maximum tests — it's tests your team trusts.

FAQ

What's the difference between a flaky test and a false positive?

A false positive signals a problem that doesn't exist — a single occurrence suffices. A flaky test produces inconsistent results run to run for the same code. A flaky test produces false positives intermittently.

How do you measure your visual tests' flakiness rate?

Run the same visual test suite several times without code changes and count tests whose results change between runs. Five consecutive runs suffice to identify the most unstable tests.

Are no-code visual tests less flaky than coded tests?

They eliminate a category of flakiness — test script fragility (brittle selectors, navigation timing, state management). But they're still subject to the same browser rendering constraints.

Should you automatically retry failed visual tests?

Retry is a band-aid, not a solution. It masks the problem. If you must enable retries, limit to one and mark tests that needed a retry for investigation.

What is an acceptable flakiness threshold in CI/CD?

Aim for below 1% of your total test suite. Above 3%, the productivity impact becomes measurable. Above 5%, your team almost certainly develops the reflex of systematically relaunching failed pipelines.

Does Delta-QA help stabilize visual tests?

Delta-QA reduces flakiness at the source by using a structural approach rather than pixel-by-pixel. Sub-pixel rendering variations, antialiasing, and timing issues that cause most intermittent failures are natively ignored. Combined with the no-code approach that eliminates fragile test scripts, Delta-QA produces reliable and reproducible test results without complex configuration.

A visual test only has value if your team trusts it. Flaky tests destroy that trust day by day. Instead of spending time stabilizing fragile scripts and sorting false failures, adopt a tool designed for reliability from the start.

Try Delta-QA for Free →