I enjoyed a time late in my corporate career where I had the opportunity to develop embedded software for an innovative beverage dispenser unit. This was a considerably different experience when compared with the software development activities involved in creating web applications and ERP systems. When creating software for those systems, in most cases, I would never see the hardware the software would run on. For the beverage dispenser projects however, the hardware was being designed at the same time as the software was being developed. Lucky for me, I had the opportunity to work side by side with the engineers developing the digital hardware.
In working with the systems engineers I had the opportunity to learn about how they did their jobs during the design phase. The systems engineers employed a tool called FMEA which stands for Failure Mode and Effects Analysis. FMEA is a highly structured technique, developed by reliability engineers in the late 1950s, to help understand the potential outcomes of malfunctions in hardware systems. Using FMEA techniques can improve the design of systems as it relates to their ability to tolerate component failure.
While FMEA was created for hardware systems I believe the tools and techniques used for FMEA have a broader applicability. For example, when designing solutions for cloud-based software applications there are still opportunities for the failure of certain components. Leading cloud platform providers such as Amazon Web Services (AWS) have created flexible components with which to architect highly resilient and fault-tolerant solutions. In the spirit of flexibility these components have been offered to solution architects to use optionally when designing solutions for customers. As with designing hardware systems, not every solution requires the same level of fault tolerance. The solutions offered by Amazon Web Services enables the solution architect to select from different options in order to design a best fit solution based on the customers goals and project needs.
Let’s take a quick look at an example FMEA worksheet taken from https://en.wikipedia.org/wiki/Failure_mode_and_effects_analysis.
You can see that this chart represents an analysis of a failure mode of a brake manifold part on an aircraft. Many more details are available on the page from which this chart was taken. Be sure and look them over for a more in depth understanding.
While this example is about an airplane part, the table represents a deceptively simple looking tool that can be used to analyze almost any kind of system including a cloud based software solution.
Let’s get a bit lighter and less sophisticated for a moment.
In the course of developing and FMEA for software subsystems I had the opportunity to call the team together for working sessions. A large part of these sessions that I participated in were with a team of people who were brainstorming. During my turns to lead a session sometimes I needed to “lighten the mood” a bit and help people feel more relaxed about the process.
I would usually start by defining scope in a simple way with a Scope Definition Question: What is the part and/or subsystem for analysis?
Next, for the more open brainstorming I would focus the team on just throwing out whatever they could think of as a response to the question – “what are the bad things could happen with this subsystem?” or “what could go wrong here?”
When things got really fun we would name a failure mode the OSC (Oh Shit Condition). This was something that happened that makes one think “Oh shit!”. I’m sure you can think of 10 things right away about the systems you work on where you might have said that. This sort of levity leads to creativity so I’ve found.
So, now that we know what FMEA is at a very high level, let us use a walk through example.
If we are designing a software solution in the cloud, for example, let us consider hosting a WordPress site on Amazon Web Services (AWS). The most basic of WordPress solutions is made up of a PHP application and a database. For the smallest of sites, we can put the WordPress PHP application and the database together on the same server. For sites with higher expected traffic or high availability requirements we can extend the solution to additional servers.
So let us consider the web server as the subsystem for analysis and use the example FMEA table. I’ve altered the formatting of the table to make it easier to read.
What we can see from using this technique is a more detailed understanding of potential failures, the probability and impact of those failures and our potential opportunities for mitigating the effects of the failures. While this is an extraordinarily oversimplified example, the intention of this short article is to introduce the technique and offer the opportunity to consider it for application in the cloud solution architecture space as a way to design more resilient solutions.
Luckily for solution architects working on AWS, there are a multitude of opportunities to mitigate system failures from single component outages. Some of the solutions offered by AWS include Elastic Load Balancers, multi availability zone scaling, cross region replication, redundant copies of data, volume snapshots and more. One could certainly use FMEA to help design fault tolerant solutions on AWS as well as use the data generated by FMEA to help customers understand the tradeoffs and risks involved in opting to select or not select fault tolerant options.
If this technique sounds interesting to you, I’d recommended learning more by reading the extremely useful WikiPedia post from which the first example table was taken. You can also learn more about the options offered by AWS by checking out their information available at the links below.