I’ve been investigating some flaky tests on one of the projects I worked on. I wanted to make sure I’d got them all, and what better way than getting a computer to check. We upload our tests results for every builds step as (JUnit) XML artifacts, so it wasn’t too hard to use the Buildkite API and pybuildkite to write a Jupyter notebook that examines each of them to find flaky tests in two ways:
- failures within a build that passed, meaning that a human retried some of the steps and it eventually worked, without the code changing
- multiple builds of a single commit, where some builds and some failed (this works best if the configuration between builds isn’t too different)
I’ve uploaded the notebook at: https://gist.github.com/huonw/5b15172499251ce88ac42a6a926e6162 including example output from https://github.com/stellargraph/stellargraph. (It may need edits to work with other projects or variants of JUnit XML.)
To run on another project, it can be:
- downloaded via “Raw”: https://gist.githubusercontent.com/huonw/5b15172499251ce88ac42a6a926e6162/raw/9b335f7a8b5f04e8a16f15bceaa63697065d41fa/flaky%20tests.ipynb
- run online in Google Colab: https://colab.research.google.com/gist/huonw/5b15172499251ce88ac42a6a926e6162/flaky-tests.ipynb
Hope this helps someone!