Videos provided by OpenStack Summit via OpenStack Foundation YouTube Channel
During the Havana release cycle we discovered Tempest was
getting comprehensive enough that it would expose interesting
timing problems in OpenStack in the OpenStack gate. Developers were
used to calling these flakey tests" and ignoring the negative
results, however we saw a pattern emerge where the same pattern for
a fail could be seen multiple times.
These "statistical failures", where a give scenario will fail 1%
of the time, become real issues when you end up with 60+ of them in
the code base, and when you create 30,000 clouds per week.
We believed we had a couple of interesting race conditions to nail
down, and started building a system based on Elastic Search to be
able to automatically identify these things. This system first
started reporting back data to developers at the very end of the
This talk will discuss the whole problem space of finding low
percentage failures in the code base. The toolchain we build upon
for the problem, and the fingerprint and reporting approach that we
use to help the OpenStack development community prioritize these
issues, how this is informing our thinking about the OpenStack
gating system, and where we are headed in the future. Because this
whole toolchain is OpenSource it's something we expect others might
extend to their own projects as well.