Skip to content

November 6, 2012

Building Resilient User Experiences Talk By Mike Brittain of Etsy

By Gene Kim

Another of my favorite talks from the Velocity Conference 2012 in June was a 12 min talk by Mike Brittain (@mikebrittain) of Etsy, where he is Director of Engineering and Operations (LinkedIn). In this talk, he described how companies should build user experiences that are resilient to failure, and less like developer error codes. Haha.

He began with a metaphor of a grocery store. He asked the audience what should happen if the restroom sink flooded. Does the security guard clear out the store, telling everyone they must stop shopping and leave, and that the store will be closed until the problem is fixed?

Obviously, that’s ridiculous. But, he suggests that this is what almost all websites and online service properties do. When we have an major error in the application/infrastructure or a minor error in a supporting service (our equivalent of the plumbing), we’ll close the entire store, kicking the customer out.

And because every service that our sites are dependent upon creates additional surface area for the site to fail, there are a bewildering number of things that can cause error. Given this, how do we make the user experience more resilient and insulated from these problems?

Here’s an example of an error message that Brittain finds deplorable. (He joked, “An error page with an application stack trace with the company logo on it isn’t much better.”)

At Etsy, they began by identifying services that make up the critical path and identifying those that only provide “primary value” vs. “ancillary value” (i.e., secondary value). For Etsy, they identified that the most important areas of the site are the Product listings, Production Descriptions, Photos, and Add to Cart feature. Everything else is just gravy.

The first design principle to building a resilient user experience is that when ancillary services fail, instead of displaying an error page or anything that interrupts the user experience, simply hide it (e.g., the “like” button, user message inbox, etc.). Casual users will simply continue their experiences unimpeded, usually without realizing anything is wrong. It will only be the extremely savvy users, he joked, that will actually notice the failure.

Etsy favors fast page load times, as opposed to completeness. For Etsy Sponsored Ads, they require that it get rendered within 400ms, or get dropped. Other areas of the site use non-blocking Ajax elements to decrease load times (by reducing load on browser client). And yet other areas of the site use the “circuit breaker” pattern described by Mike Nygard in“Release It! Design and Deploy Production-Ready Software”  to prevent clients from calling functions that we know that are failing, thus exacerbating the failure. (Ben Christiansen describes this further in the article “Fault Tolerance in a High Volume, Distributed System” at Netflix  here.)
By creating a system where failures are hidden from users and where back end services can fail independently, Etsy created the grocery store that doesn’t kick people out when the sink overflows. It allows customers to continue doing the most important thing: shopping, despite the fact that all the details didn’t make it onto the page.

“Here’s a very simple hack that everyone should do. Put a beacon on your error page, and generate a graph like this, and show it the business every day. There are the page views where the users have gotten off track, and will leave the site.”

At Etsy, Brittain says, they favor a fast page over a complete page. By focusing on the big picture and keeping customers in mind, they’re building successful UIs that help the business win.

The biggest takeaway here is that after Brittain and his team identified the critical path services and separate them from background services, they were able to plan for failure scenarios and ensure their customers weren’t “kicked out” of the store because of a failure entirely unrelated to what they were trying to do: purchase goods.

Mike Brittain, Etsy: “Resilient User Experiences”

  • @mikebrittain: “Google Apps very good at resilient arch & UI; shows retry interval”
  • RT @kerrick/@xthestreams: “wrapping a stack trace with a logo is barely better than showing the stack trace” @mikebrittain
  • @adrianco: Mike from Etsy talking about resilient user interfaces at #velocityconf – similar to circuit breaker pattern https://t.co/xp7jmGSj
  • @jheady: .@mikebrittain Covering user experience, graceful failure of features. Highlight critical and non-critical path features.
  • @TurboDad: Hiding failed features that are not CRITICAL to the user experience is a key pillar in building a resilient user experience.
  • RT @TurboDad: Hiding failed features that are not CRITICAL to user exp is key pillar in building a resilient user experience.
  • @jelder: RT @pingdom: Shoppers shouldn’t be forced to leave the store because a sink is leaking. @mikebrittain on graceful degradation. #VelocityConf
  • @xthestreams: Design for operations = product managers + devs + ops
  • @allspaw telling story of how news site, on election day, had to turn off logging, in order to save the site. Wow.
- About The Authors
Avatar photo

Gene Kim

Gene Kim is a Wall Street Journal bestselling author, researcher, and multiple award-winning CTO. He has been studying high-performing technology organizations since 1999 and was the founder and CTO of Tripwire for 13 years. He is the author of six books, The Unicorn Project (2019), and co-author of the Shingo Publication Award winning Accelerate (2018), The DevOps Handbook (2016), and The Phoenix Project (2013). Since 2014, he has been the founder and organizer of DevOps Enterprise Summit, studying the technology transformations of large, complex organizations.

Follow Gene on Social Media
Jump to Section

    More Like This

    Audit to the Rescue? – Investments Unlimited Series: Chapter 12
    By IT Revolution , Helen Beal , Bill Bensing , Jason Cox , Michael Edenzon , Dr. Tapabrata "Topo" Pal , Caleb Queern , John Rzeszotarski , Andres Vega , John Willis

    Welcome to the twelfth installment of IT Revolution’s series based on the book Investments…

    Mastering the Art of (Re)Recruiting Talent in Times of Transformation
    By IT Revolution

    In today's fast-paced and ever-evolving business landscape, organizations are constantly undergoing transformations to stay…

    What to Expect at Enterprise Technology Leadership Summit Europe Virtual
    By Gene Kim

    Holy cow, Enterprise Technology Leadership Summit Europe Virtual is happening next week, and I’m…

    Boardroom Showdown – Investments Unlimited Series: Chapter 11
    By IT Revolution , Helen Beal , Jason Cox , Michael Edenzon , Dr. Tapabrata "Topo" Pal , Caleb Queern , John Rzeszotarski , Andres Vega , John Willis

    Welcome to the eleventh installment of IT Revolution’s series based on the book Investments…