Chaos Testing: Your Software is Broken
What’s the problem?
Any software development team understands that testing of software is crucial to a successful release. Finding critical bugs in the first few days of a major software release is embarrassing, greatly reduces the confidence users have in the release, and delays adoption of the new release. Delaying adoption of a release leaves users exposed to security vulnerabilities, known bugs that should be fixed, and lacking in feature sets that make both the user and the developer's lives easier.
To combat releasing broken versions of software, we’ve designed a testing process that we lovingly call “Chaos Testing”.
What is Chaos Testing?
Chaos testing starts with the assumption that your software is broken. Chaos testing is not designed around software passing any release checklist, instead a successful chaos test is one that exposes bugs. To pass chaos testing, the team must uncover a certain number of bugs. That number of bugs should be determined by your development team.
Chaos testing starts with the assumption that your software is broken.
Chaos testing is free form. There are no testing manuals or release checklists. Manuals and checklists encourage a very specific and repeated usage pattern of the software. If you use the software in the same way every time, you’re only testing that the software works in the way the developers want it to work.
When Should We Chaos Test?
Chaos testing should occur for every major and minor release. Hot fix releases can be exempted from chaos testing as the goal of a hotfix is to patch a broken release quickly.
Chaos testing is meant to reduce the need for hotfixes.
Before starting chaos testing, one iteration of the release checklist should occur. Executing the release checklist first allows us to validate that the new features are working and some regression bugs have not been introduced.
There is no point in beginning chaos testing if the software cannot pass a release checklist. We want chaos testing to uncover bugs that would not be found during a release checklist.
Chaos Testing – Demo Day
Chaos testing always begins with a demonstration day. Invite all stakeholders and leave an open invitation to anyone else interested. Developers should take turns demonstrating the functionality they personally worked on.
Demo day offers two major benefits:
- Provides stakeholders and other developers a chance to critique and view work they may have otherwise not seen.
- Exposes potential weaknesses in the code.
Let’s consider that last point. When software developers demonstrate their software, they always have a list of steps they execute in a very specific order. During the demonstration, the developer is always secretly nervous that some aspect of the demo is going to fail or that someone watching the demo asks them to do something they’re unsure about. This exposes prime candidates for chaos testing.
As a witness to the demo, it is your job to pay attention to these moments and suggest the demonstrator deviate from the script to expose new bugs. In addition, the demonstrator should take note of the parts of the demonstration that made them nervous and prioritize that for their own chaos testing.
By the end of demo day, a list of notes and bugs should have been recorded. Finish the day by reviewing the list and creating new tickets in your issue tracking system.
After demo day has occurred, it is time to begin formal chaos testing. Again, chaos testing is free form. Checklists and testing manuals should be discarded.
While testing, developers should consider the following quote:
| “Programming today is a race between software engineers striving to build bigger and better idiot-proof programs, and the Universe trying to produce bigger and better idiots. So far, the Universe is winning.”
― Rick Cook, The Wizardry Compiled
Our goal during testing is to become the bigger and better idiot. That means completely abusing the software in any way can we imagine. Some suggestions:
- What happens if I repeatedly click on a button 1000 times?
- What happens if I press every key on the keyboard in a text field?
- What happens if I generate completely unreasonable configurations?
- What happens if external hardware disconnects/connects rapidly?
- What happens if I disconnect/connect the internet constantly?
- What happens if the connection is terrible and I’m seeing major packet loss?
The sky is the limit when chaos testing. Do anything you can to break the software. It does not matter if you’re 100% certain no user would ever take certain actions, because you’re wrong. They will.
If you want to focus your efforts, think back to demonstration day. What made you nervous during your demo? What made the other presenters nervous? These are areas to focus on.
Chaos testing should last a minimum of two days but will NOT end until a certain number of bugs have been found (as defined by your team). Your software IS broken. There ARE bugs. If you’re not finding them then chaos testing continues.
Once you’ve completed the chaos testing, it is time to move into prioritization. Not every bug the team finds is going to be critical or software breaking. Some will be minor visual bugs; others will have near zero risk to the user.
As a team, define which bugs are critical and which are minor. All critical bugs should be fixed immediately (delaying the release if necessary). Minor bugs can be fixed prior to release or moved to a future release as determined by the team.
After demonstration day, chaos testing, and prioritization have all completed, your team should begin executing release checklists and testing manuals. If the software fails the release checklist, fix the issues, and run the entire checklist again. Repeat until the release checklist passes.
Remember, developers should not be executing checklist items that apply to things they worked on.
Chaos testing has dramatically improved the quality of our software and firmware releases and has greatly reduced the number of issues our customers discover. However, it is important to remember the chaos testing creed. Your software is broken. This means, that even though you made it through chaos testing and released a new version of your software, it is still broken and you should still expect bug reports to come in.