By Jochen D.
What is AI test failure analysis?
Today we are introducing TestingBot AI Insights, our AI-powered test failure analysis feature. When an automated test fails, AI Insights reads the test's logs and explains in plain language the most likely root cause, a timeline of what happened, the supporting log evidence, a suggested fix and a confidence score. It works across Selenium, Appium, Playwright, Puppeteer, Cypress, Espresso, XCUITest and Maestro.
The AI test failure analysis uses an AI model to automatically read a test's logs, status messages and stacktraces, identify the most likely root cause and classify the failure as an application bug, a test-script problem, an environment issue or a flaky test. Instead of scrolling through a long command log to work out what went wrong, you read a short, structured explanation of why the test failed and what to do next.
In one sentence
AI Insights turns a failed test into a short, structured explanation: summary, root cause, likely owner, confidence, timeline, evidence and a suggested fix. Read the full TestingBot AI Insights documentation for the complete reference.
How does AI find the root cause of a test failure?
AI Insights follows a curated, single-shot process rather than an open-ended agent that pokes at your test. When you open the AI Analysis tab on a failed test, TestingBot assembles a small slice of the artifacts you already have, masks anything sensitive, sends it to the AI model once and streams the explanation back section by section.
- Collect: gather the failing step and its neighbours, the relevant tail of the driver or device log, native runner results and stacktraces, and the test's status, message and environment.
- Mask: redact detectable secrets and personal data before anything leaves TestingBot servers.
- Analyze: send the curated, text-only slice to the AI model for a single, focused analysis.
- Explain: return a structured verdict with a confidence score, classify the likely owner and rank one to three concrete fixes.
The result is cached with the test, so reopening it is instant and costs nothing extra. If you change the test and want a fresh read, Re-analyze runs it again.
What data is sent, and what gets masked?
Privacy is the part most failure-analysis tools gloss over, so it is the part we want to be most explicit about. AI Insights sends a small, curated, text-only slice of the test (no screenshots or video in this version). It is built from artifacts already stored on TestingBot:
- The failing step and the steps around it.
- A filtered, compressed tail of the most relevant driver or device log (Selenium, Appium, Playwright, logcat or iOS), focused on errors, warnings and the failure window.
- For native runners (Espresso, XCUITest, Maestro), the failing test results and their stacktraces or error messages, plus the Maestro flow YAML.
- The test's status, status message and termination reason, and the environment (browser or device, version and OS).
Before any of that leaves our servers, it passes through an automated masking step that redacts detectable secrets and personal data and replaces them with placeholders such as <redacted:token>. It covers, field by field:
- API keys and tokens, including JWT, AWS, Stripe, GitHub and Google key formats.
- Passwords, including values typed into password fields during the test.
-
Authorization,Cookieand similar sensitive HTTP headers. - Private keys and other high-entropy secrets.
- Email addresses and card-number-shaped values.
Masking is best-effort, not a free pass
This is pseudonymization, not anonymization. It significantly reduces what is shared, but the safest practice is still to keep secrets out of your logs in the first place. See our guidance on masking sensitive data in tests and the exact list of fields on the doc's what data is sent section.
Who runs the analysis, and is my data used for training?
We name the AI provider, because you should know where your failure logs go. Analysis is performed by Anthropic (United States), our AI sub-processor, under their standard Data Processing Addendum with EU Standard Contractual Clauses. Anthropic does not use data sent through their commercial API to train their models, and inputs and outputs are deleted within 30 days by default. The generated analysis is stored with the test on TestingBot and removed when the test is pruned, in line with our standard 30-day retention of logs and assets.
AI Insights is off by default. Because logs go to a third-party AI provider, the account owner has to opt in explicitly and can disable it again at any time, which immediately stops any further data being sent. Anthropic is listed as a sub-processor in the TestingBot Trust Center.
Can it tell a real bug from a flaky test or an environment issue?
Yes, that classification is one of the core outputs. Every analysis assigns a likely owner so you know which team or area to route the failure to. When confidence is low, the owner is shown as a tentative best guess rather than a firm verdict, which is exactly when you want the tool to say so.
| Likely owner | What it means | Typical signals |
|---|---|---|
| Application | A genuine defect in the app under test. | HTTP 500s, JavaScript errors, a broken page state. |
| Test script | An issue in the test itself. | Stale or fragile locators, a missing wait, a wrong assertion. |
| Environment | Something outside the app and the test. | Network errors, a down dependency, capacity or capability issues. |
| Flaky | Non-deterministic, likely to pass on retry. | Timing races, async waits, intermittent visual differences. |
| Unknown | Not enough evidence to be sure. | Sparse logs, ambiguous termination, low confidence. |
AI Insights is an assistant, not an oracle. It is reliable for clear-cut failures such as assertion errors, element-not-found, timeouts and HTTP errors. It is less certain for ambiguous cases such as flaky timing or visual diffs, where it tells you the confidence is low. Always verify before acting on a suggestion.
What does an analysis actually look like?
Here is an illustrative example for a common Selenium failure: a checkout button that could not be located because the test moved on before the cart finished loading. This is a constructed sample to show the shape of the output, not a real customer test.
| Summary | The test tried to click the checkout button on the cart page, but the element was not yet present in the DOM when the lookup ran, so the locator failed before the button appeared. |
| Root cause | A NoSuchElementException on #checkout-btn, because the cart was still loading via XHR when the lookup ran. The element appeared roughly 1.2s after the lookup was issued. |
| Likely owner | Test script (a missing explicit wait). |
| Confidence | High (88 / 100). |
| Timeline | Navigate to /cart, click "Update quantity", an XHR to /cart.json starts, the test issues findElement on #checkout-btn, element not present, the retry loop exhausts the implicit wait, the command fails. |
| Evidence |
14:02:11 findElement css=#checkout-btn -> no such element followed by 14:02:12 DOM mutation: button#checkout-btn added. |
| Suggested fix | Wait for the element explicitly with WebDriverWait on presenceOfElementLocated then elementToBeClickable(#checkout-btn) after the cart XHR resolves, rather than looking it up immediately after the quantity update. |
AI Insights turns minutes of log reading into a few seconds spent reading a verdict you can act on, backed by the exact log lines that justify it, instead of a raw transcript you have to interpret yourself.
How does it compare to other cloud-grid failure tools?
Most major cloud testing platforms now offer some form of AI failure analysis and the broad idea is similar across all of them: read the logs, explain the failure, suggest a fix. Where TestingBot differs is in the things teams usually only discover after they adopt a tool: what happens to your data, how many frameworks are covered without bolt-on products and how the cost behaves. The comparison below is framed around dimensions where we are deliberately strong. These contrasts reflect what the major vendors publish on their own product and documentation pages as of mid-2026, and we have deliberately not reproduced their numbers or named individual tools.
| Dimension | TestingBot AI Insights | Typical cloud-grid AI failure tools |
|---|---|---|
| Secret and PII masking | Explicit masking of tokens, passwords, headers, private keys, email addresses and card-number-shaped values before data leaves our servers, documented field by field. | Often unstated on feature pages, or limited to a general "we do not train on your data" line. |
| Model transparency | Provider named (Anthropic, United States), under a DPA with EU Standard Contractual Clauses and listed as a sub-processor in the Trust Center. | Some platforms do not name the LLM vendor at all; others disclose it but route data to several external AI providers. |
| First-failure analysis | Analyzes a single failed run, with no history of repeat failures required. | Some statistical approaches need several failures of the same test before patterns form. |
| Framework breadth | Selenium, Appium, Playwright, Puppeteer, Cypress and the native mobile runners Espresso, XCUITest and Maestro, plus codeless AI tests, on one platform. | Some ML failure features cover only a subset of frameworks, exclude native mobile apps, or require a specific SDK. |
| Cost model | Runs on demand from the AI Analysis tab, results cached for a free reopen, bounded by per-account daily and monthly limits. | Frequently gated behind enterprise sales or metered quotas, with limited public pricing. |
| Honesty about limits | States a confidence score on every result; low-confidence verdicts are flagged as a best guess up front, with an explicit reminder to verify before acting. | Marketing tends to lead with headline speed numbers; an up-front, per-result confidence score and an explicit "verify before acting" statement are uncommon on the feature pages themselves. |
AI Insights plugs directly into the test results, logs and environments you already run on TestingBot, alongside codeless AI test creation and AI Chat (natural-language browser control). It is one feature in our wider set of AI testing, not a separate product you have to bolt on.
Which frameworks and platforms does it work with?
AI failure analysis works across our full automation stack, so you do not need a different debugging tool per framework. For web and Appium tests it reads the command log and driver logs; for native mobile runners it reads the test results, stacktraces and device log; and for Maestro it also reads the flow definition.
- Web: Selenium WebDriver testing, Playwright testing, Cypress testing and Puppeteer.
- Mobile: Appium mobile testing, Espresso (Android) testing, XCUITest (iOS) testing and Maestro mobile UI testing.
- Codeless: AI-generated tests created from natural-language intent.
What are the limits of AI test failure analysis?
A few practical boundaries keep the feature predictable and the cost bounded. There are per-account daily and monthly limits. Tests older than 30 days cannot be analyzed because their logs have already been pruned, and tests that are still running cannot be analyzed until they complete. Results are cached, so reopening an analyzed test is instant and free.
How do I enable AI test failure analysis on TestingBot?
Because your logs are sent to a third-party AI provider, enabling AI Insights is an account-level decision that the owner makes once.
- Sign in as the account owner and open Account Settings.
- In the AI Analysis section, read what is sent, tick the consent checkbox and save.
- The AI Analysis tab now appears on your test detail pages. Open any failed test and it analyzes on first view, streaming in section by section.
Try it on your next red build
Turn on AI Analysis in Account Settings, open a failed test, and read the verdict instead of the log. Full setup, the complete data list and the FAQ are in the TestingBot AI Insights documentation.
Frequently asked questions
What is AI test failure analysis?
AI test failure analysis uses an AI model to automatically read a failed test's logs, status messages and stacktraces, identify the most likely root cause, and classify the failure as an application bug, a test-script problem, an environment issue or a flaky test. TestingBot AI Insights does this for Selenium, Appium, Playwright, Puppeteer, Cypress, Espresso, XCUITest and Maestro tests, returning a summary, root cause, likely owner, confidence score, timeline, evidence and a suggested fix.
How does AI find the root cause of a test failure?
It collects a curated slice of the test (the failing step and its neighbours, the relevant tail of the driver or device log, native runner results and stacktraces, plus status and environment), masks anything sensitive, sends it to the AI model once, and returns a structured explanation. The result includes a ranked set of one to three suggested fixes and a confidence score, and is cached with the test so reopening it is instant.
What data does AI use to analyze a failed test?
AI Insights sends a small, text-only slice: the failing step and its neighbours, a filtered tail of the most relevant log (Selenium, Appium, Playwright, logcat or iOS), native runner stacktraces, the Maestro flow YAML, the test status, message and termination reason, the environment, and any codeless test intent. No screenshots or video are sent in this version. Detectable secrets and personal data are masked before anything leaves TestingBot servers. The exact list is documented on the what data is sent page.
Can AI tell the difference between a real bug, a flaky test and an environment issue?
Yes. Every analysis assigns a likely owner: application, test script, environment, flaky or unknown. This classification is the part that tells you who should look at the failure. When the evidence is ambiguous, for example a timing race or a visual difference, the confidence score drops and the owner is shown as a tentative best guess rather than a firm verdict.
Does it work with Selenium, Cypress, Playwright and Appium?
Yes. AI Insights works with Selenium, Appium, Playwright, Puppeteer and Cypress, as well as the native mobile runners Espresso, XCUITest and Maestro, and codeless AI tests. For web and Appium tests it reads the command log and driver logs; for native runners it reads the test results, stacktraces and device log, and for Maestro it also reads the flow definition.
Why did my Selenium test fail when nothing in my code changed?
A test can fail without a code change because of timing, the environment or a real defect that just surfaced. Common causes are an async update that the test did not wait for (a race condition), a fragile locator that the app's markup moved, a slow or down dependency, or an actual application bug exposed by new data. AI Insights inspects the failing run's logs and tells you which of these is most likely, so you do not have to guess whether it is flaky or a genuine regression.
How is AI test failure analysis different from a normal test report?
A normal test report tells you that a test failed and gives you the raw artifacts: logs, a stack trace, maybe a screenshot. You still have to read them and infer the cause. AI test failure analysis interprets those artifacts for you, stating the most likely root cause in plain language, classifying the likely owner, citing the specific log lines as evidence, and suggesting a fix, with a confidence score that tells you how much to trust the verdict.
Is my data used to train the AI model?
No. Analysis is performed by Anthropic in the United States, our AI sub-processor, under their Data Processing Addendum with EU Standard Contractual Clauses. Anthropic does not use data sent through their commercial API to train their models, and deletes inputs and outputs within 30 days. The feature is off until the account owner opts in, and Anthropic is listed as a sub-processor in the TestingBot Trust Center.