Is it just me or does it seem like things have a tendency to break on the weekend when the parts and service repair companies are closed? This past weekend, as temperatures outside nudged up above 100°F, a strange electrical odor started filling the house. By evening, it was starting to get warm inside. I did a quick check of the air conditioning system. The condenser unit outside was running, but the inside blower was not working. Oh great! Why is the blower not working? Turning “fan only” mode on didn’t fix it. I did some quick investigation on our system and discovered we have a ECM blower motor. ECM ,which stands for electronically commutated motor, means that it is a microprocessor controlled motor that is designed to optimize efficiency. It is very cool! I have to admit, by this point I was distracted by the tech and not focused on the restoration of service. With the warming room and questions from the rest of my family, I was reminded to go back into incident management mode and focus on addressing the outage.
I took apart the unit and began testing the control circuit and all the voltages. The main 110V feed was going to the motor. The ECM 24V control lines that program the motor speed were also good. The air handler control system seemed solid, so that meant it had to be the motor itself, and very likely the microcontroller that runs the motor. Of course, some quick online searches revealed that this wasn’t going to be an easy replacement with our current supply chain issues and chip shortages. I reached out to several service companies and received several, “We will get back with you on Monday” responses.
Now the funny part of the recent adventure is that we just had the entire system inspected and serviced earlier this year. We didn’t want to end up in the heat of the summer and unable to find service or parts to fix any issues. Yet, here we are. This is a great, if not warm, reminder that failures will occur despite all our efforts.
Things fail. Reliability engineering teaches us that it is not “if,” but “when.” Because we know things will fail, we design systems and processes to mitigate those failures, restoring service as fast as possible. Done well, failures are addressed quickly, and sometimes even automatically, helping us continue to meet our service level objectives for the particular system. With complex systems, it can be challenging to plan for all the failure modes that can occur. How do you know what can fail? One way to achieve this is by probing the weakness and safety boundaries of a system through Chaos Engineering.
We often laugh that we don’t need Chaos Engineering because the apps we run inject their own chaos! But, Chaos Engineering is not about testing known broken parts of the system. If we already know something is broken (like a motor or an app), we should prioritize and fix it. Chaos Engineering is about discovering otherwise unknown weaknesses and limits of a working system. By introducing various degrees of planned failures (chaos) into our system, we learn new ways our service levels (SLOs) can be impacted. This gives us the opportunity to learn more about the system and improve it for faster recovery when it does fail.
Learning is key. Failures, chaos engineering and experimentation are all teachers who can impart wisdom to all who seek their advice. And as I just experienced this past weekend, life is full of lessons. The only true failure is the failure to learn. We are surrounded by a laboratory of learning that can instruct us and make us better. Let’s make sure we maximize the opportunities that come our way and implement improvements that make the systems we run, better. Oh, and depending on your heat tolerance level, you may want to stock a spare EMC blower motor on the shelf.
Jason Cox is the coauthor of the upcoming book Investments Unlimited: A Novel about DevOps, Security, Audit Compliance, and Thriving in the Digital Age.