Apparently, my company has been going through a turbulence experience over the past three months regarding the integration of Agentic AI into our automation testing processes. What we want to describe is related how might testing process could be fully automated with Agentic AI. Frankly, at the beginning, we were hesitant and lacked confidence in our ability to drive this transition, particularly the migration from traditional manual scripting automation to a workflow that is largely generated or assisted by AI

However, over time, we managed to reach a point where nearly 90% of our UI automation testing scripts were generated with AI assistance. We conducted several research and proof-of-concept experiments before ultimately deciding to adopt tools such as OpenCode and Claude and we went with Playwright as our primary testing framework. These have effectively become our de facto tools for AI-assisted automation testing

That being said, our primary focus was not on automating ongoing or upcoming features immediately. Instead, we chose to take a gradual approach by first rewriting and migrating our existing automation scripts. Once that process proved feasible, we began automating recently deployed features into our automation pipeline

After working this way for an entire quarter, the general consensus across the company especially within the Quality Engineering team, was that AI still has a long way to go before it can perform the kind of agentic testing as envisioned. At its current stage, it cannot reliably transform product requirement documents (PRDs) into proper, production-ready test cases automatically, even when provided with well-structured context and carefully designed skills or sets of instructions

We have concluded that a human-in-the-loop (HiTL) process remains essential. AI can generate the initial automation scripts, but they still require thorough and rigorous manual review1. From a technical standpoint, we initially built our setup within a single repository for one team to evaluate its effectiveness. After observing positive results, we gradually rolled it out to other teams by providing them with scaffolded projects containing predefined contextual information documented in files such as Agents.md and Skills.md

At this point, the system has been implemented across four different teams, with the entire Quality Engineering organization actively contributing to and improving the shared framework

The biggest challenge during the setup process, and even after reaching a usable state it was:

  • managing contextual understanding of the product or feature being tested
  • along with the limitations imposed by token constraints
  • and last but not least… the impossible requests or expectations that we have for the AI to perform from higher ups HAHAHA

I’m not sure whether this reflects our skill issue or limitations in the AI itself, but whenever the models encounter flaky tests or failed test cases during development, they eventually begin to “mumble around” the problem and overcomplicate the solution. What surprises the most is that rarely happens on the product/backend/frontend development side, but it always frequently happens in the testing side

This happens even though we have established strict rules such as:

  • instructing the AI to enter planning mode first
  • we providing precise UI elements for interaction
  • and explicitly directing it to avoid unnecessary complexity

Regardless of whether we use the most advanced available models or different approaches, the outcome often deteriorates in the same way after a certain point

A question that probably never being asked

  • How about with API or integration testing? Eventually, our API testing framework is rarely being generated with AI assistance since it has a lot of technical details and requires a deeper understanding of our distributed system architecture, which is something that the AI still struggles to grasp2. Well, the existing API testing platform that we have built in-house is kinda overlapping and mixed with deterministic/non-deterministic testing, database testing, probabilistic test, fuzzing and contract testing and so on. So, we decided to keep it as it is for now
  • What was the impact after implementing AI-assisted automation testing? Honestly, we haven’t seen a significant improvement unless in terms of test coverage or the number of test cases that were generated. However, the quality and the result of the generated scripts are still not up to the mark and questionable. On average, across four teams, we have observed that there is a 30-40 % increase of flaky tests and failed test cases (on weekly basis) after implementing AI-assisted automation testing, which is quite concerning for us…
  • How about with UI automation test on mobile apps? We already decided to implement AI-assisted since last year with the help of Maestro MCP, but the majority of the scripts generated it’s equal 50-50 between human-assisted and AI-assisted. To be frank, we don’t have any complaints about this, since the mobile app testing is more complex and has a lot of confirmation layers, so we think that the current state of AI-assisted automation testing is quite good for mobile app testing

Verdict: 6/10, with a lot of room for improvement

Footnotes

  1. Somewhat, this is also becomes a backfired for us, since we have to spend more time reviewing all the s*** from the AI and fixing the scripts generated by the them which is ironically more time-consuming that writing the scripts manually in the first place. To be frank, we also have an internal AI-assisted code review system that we built in-house, but surprisingly it also has the same issue… again, perhaps this also a skill issue for us

  2. I’m not imposed to say that the latest AI models are not good at understanding complex technical details, but in my experience so far, they seem still struggle to generate proper result and far from what i’m expecting. It was surprisingly great to generate broader edge cases and scenarios, but it also has a lot of loopholes or disjointed test scenarios that are not really relevant