Failure Analysis: Is it Important?
Yes.
When I was a kid, my dream job was to be an investigator. That’s not actually true, but I feel like that’s a pretty good opening line for a blog post about failure analysis. I was more of a Space Jam kid and wanted to be a basketball player when I grow up because – and I remember this very distinctly – “they can jump high.”
I am currently in my late 20’s, I have not signed any million dollar contracts, and I am not best friends with Michael Jordan (MJ, if you’re reading this: hmu, we can get tacos).
And I can’t jump high.
Not a lot of investigative work to do there.
However, the design and development of any product often involves failures that necessitate a fair amount of investigative effort into what failed, what caused it to fail, and how to address why it failed. In general, the most difficult part of a failure analysis (FA) is in determining the root cause of a failure; once the cause is known, corrective actions can be taken to improve the design. Therein lies the value of failure analysis: A successful FA identifies weaknesses and prescribes improvements that result in a more robust design.
It might be tempting to think that having to do an FA is a bad thing and, in a sense, it is: There would be no need for an analysis if there was never a failure to begin with. However, in practice it is near impossible for an initial design to accurately capture all conditions and address all potential modes of failure; especially in the early stages of development, technology that is totally free from some kind of failure due to design oversights is about as rare as a steak that is cooked very rare[1].
With the correct mindset, having to perform an FA is not necessarily a bad thing – it becomes an invaluable asset in developing technology and taking it from being a proof of concept to being a reliable tool.
Unfortunately, unlike the case of “who murdered Nathan’s childhood dreams,” the investigations into tool failures are generally more involved than simply trying to identify a specific flavor of Blue Bell (it was Cookie Two Step). Depending on the failure, it can sometimes be a straightforward analysis: High levels of shock and vibration caused a wire to break, which caused the electronics to lose power; Solution: Add mechanical reinforcement to the wire.
Other times, the root cause of the failure is more complex: Event A caused Events B and C which, in the presence of X, could result in Event D, which contributed to the failure, but was not solely responsible for it; Solution: It’s complicated, queue Avril.
Regardless of the complexity of a given failure, there are a few elements that are common to many (not all) failure analyses. The list that follows is a summary of some lessons I learned in the early days of my NBA (Now it’s Broken Analysis) career:
- Understand the Failure – This sounds like it should be obvious; however, as a system gets more complicated and as the number of different people handling the system increases, the details of what exactly happened can quickly become murky.
In an ideal world, when a component in a system fails, detailed notes are taken on the symptoms of the failure, the conditions and circumstances around when the failure occurred, and any other relevant information; then, the component in question is isolated and returned to the engineer who is to perform the analysis in a controlled environment.
Generally, the symptoms of a failure are fairly apparent; unfortunately, however, notes on the circumstances (how the tool was being used, where it was being used, whether there were any unexpected environmental factors present, etc) are often vague or even unknown. It becomes part of the analysis to determine what the tool was exposed to that may or may not have contributed to the failure. Neglecting to explore the context of the failure can often result in a loss of valuable information, an incomplete analysis, or sometimes even a cold case. - Characterize the Failed Component – When a component fails in the field and is returned to be analyzed, one of the very first steps should be to document the state of the component upon reception. Is there any physical damage? Take pictures. What happens when you try powering it up? Write down what happens. Are all the power rails working as expected? It’s surprising how often something as simple as measuring the power rails leads to useful discoveries. Sometimes, if you’re lucky, the failure analysis starts and finishes in characterizing the component. Other times, the root cause is more elusive and requires additional effort to identify.
- Recreate the Failure – More often than not, recreation of the failure in the lab becomes the backbone of the analysis. Being able to consistently recreate a failure allows for behavior of the system to be analyzed in the moments leading to a failure, in the moments during a failure, and in the moments immediately following a failure. All that information is invaluable in identifying where the weak spots are and how to address them. Then, once the design is updated to address the weaknesses, it can be exposed to the conditions that originally caused the failure and an assessment can be made on the effectiveness of the corrective measures that were implemented.
There are, unfortunately, destructive failure modes that result in parts being damaged or destroyed. Those types of failures are more difficult to handle, especially in terms of lab recreation. That’s a topic for another day.
There are a lot of different ways that something can fail and what was presented here is just the tip of the NBA iceberg. The lessons-learned items that I gave above are what I consider to be three of the more important elements of failure analysis, but there is a lot more that could be said when it comes to identifying a problem, determining what caused it, and coming up with a solution to prevent from happening again.
Performing a failure analysis is a bitter-sweet experience: On the one hand you have to face the reality that design oversights were made that resulted in failure and, the more developed a product becomes, the less acceptable it is to have those failures. However, on the other hand, it’s through the failure analyses that weaknesses are identified and opportunities for design improvements are revisited; the analyses are how the product becomes developed. By keeping a positive attitude and thinking of failures as guided avenues for improvement rather than just “something that needs fixing,” you can develop your product into a real slam dunk! (MJ, I’m serious about those tacos)
[1] I realize that the ‘very rare steak’ analogy is pretty bad – as far as analogies go, it falls short. Kind of like when I try to alley-oop dunk. But, no joke, when I was in high school, I did hit a half-court shot. It wasn’t during a game or anything like that, it was during one of those PE periods where pretty much the entire class was gone that day because they were doing an organized sport thing and the PE coach told the rest of us, “It’s a free day, spend the period doing something physically active.” So, I sat on the bench for like 5 minutes before he told me that I can’t just sit on the bench; that’s when I went full beast-mode and started shooting one air-ball after another. Eventually, after hundreds of attempts, I two-hand-tomahawked one from half-court and nailed it. First try. Coach witnessed it. He gave me a thumbs up and said, “try out for the team!” We grinned at each other, both of us knowing that there was no way in hell that I was going to try out for the team. That was my freshman year of high school. That was the day I peaked.
Other times, the root cause of the failure is more complex: Event A caused Events B and C which, in the presence of X, could result in Event D, which contributed to the failure, but was not solely responsible for it; Solution: It’s complicated, queue Avril.
Regardless of the complexity of a given failure, there are a few elements that are common to many (not all) failure analyses. The list that follows is a summary of some lessons I learned in the early days of my NBA (Now it’s Broken Analysis) career:
- Understand the Failure – This sounds like it should be obvious; however, as a system gets more complicated and as the number of different people handling the system increases, the details of what exactly happened can quickly become murky.
In an ideal world, when a component in a system fails, detailed notes are taken on the symptoms of the failure, the conditions and circumstances around when the failure occurred, and any other relevant information; then, the component in question is isolated and returned to the engineer who is to perform the analysis in a controlled environment.
Generally, the symptoms of a failure are fairly apparent; unfortunately, however, notes on the circumstances (how the tool was being used, where it was being used, whether there were any unexpected environmental factors present, etc) are often vague or even unknown. It becomes part of the analysis to determine what the tool was exposed to that may or may not have contributed to the failure. Neglecting to explore the context of the failure can often result in a loss of valuable information, an incomplete analysis, or sometimes even a cold case. - Characterize the Failed Component – When a component fails in the field and is returned to be analyzed, one of the very first steps should be to document the state of the component upon reception. Is there any physical damage? Take pictures. What happens when you try powering it up? Write down what happens. Are all the power rails working as expected? It’s surprising how often something as simple as measuring the power rails leads to useful discoveries. Sometimes, if you’re lucky, the failure analysis starts and finishes in characterizing the component. Other times, the root cause is more elusive and requires additional effort to identify.
- Recreate the Failure – More often than not, recreation of the failure in the lab becomes the backbone of the analysis. Being able to consistently recreate a failure allows for behavior of the system to be analyzed in the moments leading to a failure, in the moments during a failure, and in the moments immediately following a failure. All that information is invaluable in identifying where the weak spots are and how to address them. Then, once the design is updated to address the weaknesses, it can be exposed to the conditions that originally caused the failure and an assessment can be made on the effectiveness of the corrective measures that were implemented.
There are, unfortunately, destructive failure modes that result in parts being damaged or destroyed. Those types of failures are more difficult to handle, especially in terms of lab recreation. That’s a topic for another day.
There are a lot of different ways that something can fail and what was presented here is just the tip of the NBA iceberg. The lessons-learned items that I gave above are what I consider to be three of the more important elements of failure analysis, but there is a lot more that could be said when it comes to identifying a problem, determining what caused it, and coming up with a solution to prevent from happening again.
Performing a failure analysis is a bitter-sweet experience: On the one hand you have to face the reality that design oversights were made that resulted in failure and, the more developed a product becomes, the less acceptable it is to have those failures. However, on the other hand, it’s through the failure analyses that weaknesses are identified and opportunities for design improvements are revisited; the analyses are how the product becomes developed. By keeping a positive attitude and thinking of failures as guided avenues for improvement rather than just “something that needs fixing,” you can develop your product into a real slam dunk! (MJ, I’m serious about those tacos)