Another of my favorite talks from the Velocity Conference 2012 in June was a 12 min talk by Mike Brittain (@mikebrittain) of Etsy, where he is Director of Engineering and Operations (LinkedIn). In this talk, he described how companies should build user experiences that are resilient to failure, and less like developer error codes. Haha.
He began with a metaphor of a grocery store. He asked the audience what should happen if the restroom sink flooded. Does the security guard clear out the store, telling everyone they must stop shopping and leave, and that the store will be closed until the problem is fixed?
Obviously, that’s ridiculous. But, he suggests that this is what almost all websites and online service properties do. When we have an major error in the application/infrastructure or a minor error in a supporting service (our equivalent of the plumbing), we’ll close the entire store, kicking the customer out.
And because every service that our sites are dependent upon creates additional surface area for the site to fail, there are a bewildering number of things that can cause error. Given this, how do we make the user experience more resilient and insulated from these problems?
Here’s an example of an error message that Brittain finds deplorable. (He joked, “An error page with an application stack trace with the company logo on it isn’t much better.”)
At Etsy, they began by identifying services that make up the critical path and identifying those that only provide “primary value” vs. “ancillary value” (i.e., secondary value). For Etsy, they identified that the most important areas of the site are the Product listings, Production Descriptions, Photos, and Add to Cart feature. Everything else is just gravy.
The first design principle to building a resilient user experience is that when ancillary services fail, instead of displaying an error page or anything that interrupts the user experience, simply hide it (e.g., the “like” button, user message inbox, etc.). Casual users will simply continue their experiences unimpeded, usually without realizing anything is wrong. It will only be the extremely savvy users, he joked, that will actually notice the failure.
Etsy favors fast page load times, as opposed to completeness. For Etsy Sponsored Ads, they require that it get rendered within 400ms, or get dropped. Other areas of the site use non-blocking Ajax elements to decrease load times (by reducing load on browser client). And yet other areas of the site use the “circuit breaker” pattern described by Mike Nygard in“Release It! Design and Deploy Production-Ready Software” to prevent clients from calling functions that we know that are failing, thus exacerbating the failure. (Ben Christiansen describes this further in the article “Fault Tolerance in a High Volume, Distributed System” at Netflix here.)
By creating a system where failures are hidden from users and where back end services can fail independently, Etsy created the grocery store that doesn’t kick people out when the sink overflows. It allows customers to continue doing the most important thing: shopping, despite the fact that all the details didn’t make it onto the page.
“Here’s a very simple hack that everyone should do. Put a beacon on your error page, and generate a graph like this, and show it the business every day. There are the page views where the users have gotten off track, and will leave the site.”
At Etsy, Brittain says, they favor a fast page over a complete page. By focusing on the big picture and keeping customers in mind, they’re building successful UIs that help the business win.
The biggest takeaway here is that after Brittain and his team identified the critical path services and separate them from background services, they were able to plan for failure scenarios and ensure their customers weren’t “kicked out” of the store because of a failure entirely unrelated to what they were trying to do: purchase goods.
Mike Brittain, Etsy: “Resilient User Experiences”
- @mikebrittain: “Google Apps very good at resilient arch & UI; shows retry interval”
- RT @kerrick/@xthestreams: “wrapping a stack trace with a logo is barely better than showing the stack trace” @mikebrittain
- @adrianco: Mike from Etsy talking about resilient user interfaces at #velocityconf – similar to circuit breaker pattern http://t.co/xp7jmGSj
- @jheady: .@mikebrittain Covering user experience, graceful failure of features. Highlight critical and non-critical path features.
- @TurboDad: Hiding failed features that are not CRITICAL to the user experience is a key pillar in building a resilient user experience.
- RT @TurboDad: Hiding failed features that are not CRITICAL to user exp is key pillar in building a resilient user experience.
- @jelder: RT @pingdom: Shoppers shouldn’t be forced to leave the store because a sink is leaking. @mikebrittain on graceful degradation. #VelocityConf
- @xthestreams: Design for operations = product managers + devs + ops
- @allspaw telling story of how news site, on election day, had to turn off logging, in order to save the site. Wow.