The Relationship Between Developers and Operations at Flickr

Ross Harms who is formerly of Flickr and currently at Etsy, published a memo he sent around Yahoo! in 2009 explaining the relationship between developers and operations at Flickr:

Here is a quote from the post:

I did this in the hope that other Yahoo properties could learn from that team’s process and culture, which we worked really hard at building and keeping. The idea that Development and Operations could: (1) Share responsibility/accountability for availability and performance, (2) Have an equal seat at the table when it came to application and infrastructure design, architecture, and emergency response, (3) Build and maintain a deferential culture to each other when it came to domain expertise, and (4) Cultivate equanimity when it came to emergency response and post-mortem meetings.

My Comment To the Post

Very nice post and all quite obvious to folks with enough experience across multiple real-world situations. Usually when organizations don’t structure their ops / dev relationships as you describe, it is often in an obsessive attempt to “eliminate risk”.

The basic (incorrect) premise is that everything the developers do increases risk and that ops have the job of reducing that risk to zero. Developers are the “problem” and Ops is the “solution”. Or as you say above, Developers are the “Arsonists” and Ops are the “Firefighters”. Casting the relationship in this way leads to ops wanting to limit change and the devs naturally want the product to move forward so the organization can better serve its stakeholders.

Uninformed ops feel the need to do large tests with complete instances of the product and frozen “new versions” and as the product gets more complex, these test phases take longer and longer and so more and more features end up in each release.

Again, ops is trying to eliminate risk – but in reality because each release is larger and larger there is a super-linear likelihood that something will go wrong. And when there are a lot of features in a package upgrade, folks cannot focus on the changes because there are too many – they hope it is all OK or sometimes it is all declared “bad” as a package without looking for the tiny mistake and everyone goes back to the drawing board which further delays the release of functionality and insures that the next release attempt will be even larger and even more likely to fail. It is a vicious circle that your approach nicely avoids.

The gradual approach you describe allows everyone to focus intently on one or a few changes at a time and do it often enough that you avoid the risk of a large change consisting of lots of details.

I like to think of the way you describe as “amortizing risk” – where there is always a small amount of risk that everyone understands but you avoid the buildup of accumulated risk inherent in large package upgrades. Again, thanks for the nice description.