Skip to content

September 22, 2022

Failure

By Jason Cox

Is it just me or does it seem like things have a tendency to break on the weekend when the parts and service repair companies are closed? This past weekend, as temperatures outside nudged up above 100°F, a strange electrical odor started filling the house. By evening, it was starting to get warm inside.  I did a quick check of the air conditioning system.  The condenser unit outside was running, but the inside blower was not working.  Oh great!  Why is the blower not working?  Turning “fan only” mode on didn’t fix it.  I did some quick investigation on our system and discovered we have a ECM blower motor.  ECM ,which stands for electronically commutated motor, means that it is a microprocessor controlled motor that is designed to optimize efficiency. It is very cool!  I have to admit, by this point I was distracted by the tech and not focused on the restoration of service.  With the warming room and questions from the rest of my family, I was reminded to go back into incident management mode and focus on addressing the outage.

I took apart the unit and began testing the control circuit and all the voltages.  The main 110V feed was going to the motor. The ECM 24V control lines that program the motor speed were also good.  The air handler control system seemed solid, so that meant it had to be the motor itself, and very likely the microcontroller that runs the motor. Of course, some quick online searches revealed that this wasn’t going to be an easy replacement with our current supply chain issues and chip shortages. I reached out to several service companies and received several, “We will get back with you on Monday” responses.

Now the funny part of the recent adventure is that we just had the entire system inspected and serviced earlier this year.  We didn’t want to end up in the heat of the summer and unable to find service or parts to fix any issues. Yet, here we are. This is a great, if not warm, reminder that failures will occur despite all our efforts.  

Things fail.  Reliability engineering teaches us that it is not “if,” but “when.”  Because we know things will fail, we design systems and processes to mitigate those failures, restoring service as fast as possible.  Done well, failures are addressed quickly, and sometimes even automatically, helping us continue to meet our service level objectives for the particular system.  With complex systems, it can be challenging to plan for all the failure modes that can occur.  How do you know what can fail?  One way to achieve this is by probing the weakness and safety boundaries of a system through Chaos Engineering.

We often laugh that we don’t need Chaos Engineering because the apps we run inject their own chaos!  But, Chaos Engineering is not about testing known broken parts of the system. If we already know something is broken (like a motor or an app), we should prioritize and fix it. Chaos Engineering is about discovering otherwise unknown weaknesses and limits of a working system. By introducing various degrees of planned failures (chaos) into our system, we learn new ways our service levels (SLOs) can be impacted. This gives us the opportunity to learn more about the system and improve it for faster recovery when it does fail.

Learning is key.  Failures, chaos engineering and experimentation are all teachers who can impart wisdom to all who seek their advice.  And as I just experienced this past weekend, life is full of lessons.  The only true failure is the failure to learn.  We are surrounded by a laboratory of learning that can instruct us and make us better.  Let’s make sure we maximize the opportunities that come our way and implement improvements that make the systems we run, better.  Oh, and depending on your heat tolerance level, you may want to stock a spare EMC blower motor on the shelf. 


Jason Cox is the coauthor of the upcoming book Investments Unlimited: A Novel about DevOps, Security, Audit Compliance, and Thriving in the Digital Age.

- About The Authors
Avatar photo

Jason Cox

Jason Cox is a champion of DevOps practices, promoting new technologies and better ways of working. His goal is to help businsses and organizations deliver more value, inspiration and experiences to our diverse human family across the globe better, faster, safer, and happier. He currently leads SRE teams at Disney and is the coauthor of the book Investments Unlimited. He resides in Los Angeles with his wife and their children.

Follow Jason on Social Media

No comments found

Leave a Comment

Your email address will not be published.



Jump to Section

    More Like This

    Navigating Cloud Decisions: Debunking Myths and Mitigating Risks
    By Summary by IT Revolution

    Organizations face critical decisions when selecting cloud service providers (CSPs). A recent paper titled…

    The Phoenix Project Comes to Life: Graphic Novel Adaptation Now Available!
    By IT Revolution

    We're thrilled to announce the release of The Phoenix Project: A Graphic Novel (Volume…

    Embracing Uncertainty: GenAI and Unbundling the Enterprise
    By Matt McLarty , Stephen Fishman

    The following post is an excerpt from the book Unbundling the Enterprise: APIs, Optionality, and…

    From Prose to Panels: The Journey of Turning The Phoenix Project into a Graphic Novel
    By IT Revolution

    A few years ago, Gene Kim approached me with an intriguing question: What would…