Agenda (July 11, 2016)
- 09:00 AM: Splicing SRE DNA Sequences, Greg Veith, Microsoft
- 09:40 AM: Doorman: Global Distributed Client Side Rate Limiting, Jos Visser, Google
- 11:00 AM: What is SRE? Panel Session
- 11:40 AM: Building and Running SRE Teams, Kurt Andersen, LinkedIn
Splicing SRE DNA Sequences in the Biggest Software Company on the Planet (Greg Veith, Director of Azure Site Reliability Engineering @ Microsoft)
Microsoft Azure has 58 service offerings, is in 24 regions and 100 data centers, receives 120K subscriptions per month and there are 3.3 million messages processed per second only by the Azure IoT services. There has been a transformation in the company towards open source, Linux and Apache software.
Deep down I believe that features already shipped and customers are using are the most important; new features are a nice-to-have. In 2009, six months after launch, Bing suffered a 5 minute outage which led to a couple days of postmortem analysis as we tried to learn and apply those learnings as action items. In the end, there were many changes made to how we built and released software. The executive team decided to invest a lot in building an SRE after this.
SRE is not something that you can call yourself, it is something that you earn. I labelled my team as SRE as a matter of intent and direction. There is a cultural engineering problem that needs to be dealt with: I joined the team, was welcomed and immediately received a pager. People usually want an easy button for SRE. However, SRE is not Operations 2.0. There were a couple hundred folks in service operations and they weren't an offense team; they were defensive and a classic operations team. The right folks for SRE are pioneers and you need to have the right models in place, and the right people in place. Production services remain online for many years and they need to be properly looked after. As I built out the SRE function in Azure, I was given three goals: 1) build SRE, 2) don't mess up operations, 3) improve optics into the system.
Symptoms of Success
- Availability and reliability meet SLOs (defend customer trust)
- Eliminate human touches to production (toil elimination)
- Speed up deployments (reduce inventory, ship fast, safely)
All the above areas reinforce measurement which is reliability's foundation.
3 Strategic Pillars
- Start SRE at Microsoft (establish principles)
- Prove the model (apply principles)
- Accelerate and improve (scale the principles)
I laid the groundwork by hiring pioneers - folks who were subject matter experts from Google, EMC, Amazon, etc. We are a team that is constantly challenging itself and eliminating group think. We additionally built a cross-Microsoft group to try to define this SRE role and try to maintain a bar together as a company - not only within Azure.
SRE Engagement Types
- Services at Plantary Scale: Ops Transformation at Scale - SRE develops solutions to close operational gaps, fire suppressant, iterate toward transformation.
- Newer Service Facing Rapid Growth: Growth and Maturation - SRE attaches to team, develop targeted improvements to prepare for growth, get on call.
- Greenfield Services or Redesign: Design and Architecture - Operability and continuous innovation, design for scale from the beginning.
The IoT services were new, we delivered an opportunities document that touched on reducing build time, idempotency of the builds, real-time metrics to measure the SLO of the services, and refactorings to decouple services. This was a team of seven engineers. Our principle was to leave it better than we found it.
The Azure storage service was gigantic and at planetary scale. The SRE team built a demand-shaping service, balancing capacity across multiple regions. This was a stack-augmentation approach which is important to note because we're fully formed engineers; we're not managing stuff that is thrown over the wall. It required us to traverse more of the system. We eliminated toil and we were also on-call.
Production Virtuous Cycle
Goal: enable this loop to run as fast and often as possible while maintaining SLOs.
Code -> Test -> Deploy -> Monitor, Measure, Alert -> Mitigate, Restore -> Post-Mortem, Learn -> Back to Code
SRE Areas of Focus
- Metrics and Monitoring: instrumentation, SLOs, alarms, insights -> actions
- Infrastructure Engineering: tooling, infra for global optima
- Release Engineering: change management, deployment
- Incident Response: enough said
- Common Infrastructure: integrating existing best in class infrastructure
- Capacity & Fleet Management: build out, decommission, fleet understanding and management
Metrics & Monitoring
Nobody needs more data than SRE. If you don't measure it, it doesn't matter. Measure what matters so you have the capability to quickly identify issues. One of the first things I did was to build the Operational Intelligence team. We defined SLOs, work on data hygiene and quality, and made sure the metrics were world-class. I needed them to be able to contribute to any code base very quickly.
The most important features are the ones that we've already shipped and customers are using. Therefore, minimize incident time.
Critical Moves, Learnings
- Build and protect the SRE brand: pulling in operational leaders from different parts of the company to co-create this. Interaction with the HRBP was a great advantage. Engaging recruiting has been an important relationship because we hadn't built SRE teams before.
- Manage the change
- Meet teams where they are: not where you want them to be, happy to jump in, not accept that fire alarms are the way to go, eliminating the toil
- Grab a shovel (and build a backhoe)
- Find the bright spots
Culture is not the problem, it is an outcome of incentives and results. The power to drive this change exists within your organization and it's a multi-year effort.
Doorman: Global Distributed Client Side Rate Limiting (Jos Visser, Senior Staff SRE @ Google)
In the SRE function at Youtube, we encounter the use case of a lot of video uploads and a lot of people watching videos simultaneously. There's this magical number that appears next to each video indicating the number of views; with some spam views removed. All of this information is stored in a MySQL database. One of the problems with MySQL is that it doesn't rate limit very well. Therefore, we were left with the problem of figuring out the best way for these tasks, that computed views and did spam removal, to rate limit themselves.
We wanted a solution in which any number of client tasks would coordinate among themselves so that together they wouldn't send more traffic than it could handle to a shared resource such as MySQL, but at the same time we always wanted to fully utilize the available capacity. From that came Doorman, a system to do global, distributed, client-side rate limiting:
Apportions available capacity and leases it to the clients globally. Each client gets their fair share, for some pluggable definition of fair. The Doorman server gives out capacity based on a client's requested (wanted) capacity. Clients uphold the limits with the help of a Doorman-provided library.
Any client that wants to participate in Doorman will do so using the Doorman protocol. The first step is for it to discover the master. A request is composed of:
- client id + priority
- resource id
- current capacity (has)
- wanted capacity (wants)
The response to any request contains the:
- assigned capacity (gets)
- lease expiration time (in 5 mins)
- refresh interval (every 5 secs)
It should be noted that the determination of the wanted capacity parameter is hard for a client. For this reason, the client libraries provided by Doorman have magical logic baked in them to look at the patterns of waits when executing the RPC calls to try and figure out how fast the client would like to go. Capacity is a floating point number - the Doorman server is agnostic as to what it means (i.e. queries per second, query cost per second, max inflight transactions).
Doorman Client Library
It's currently supported for the Go programming language. The Python and C++ libraries were (or will be) open sourced and there are plans to support Java and Lisp. The client library is an interface for:
- registering a new resource
- declaring the wanted capacity for that resource
- registering a callback for when a new capacity lease comes in
There is a helper within the library for rate limiting with the added bonus, as noted earlier, that it figures out the wanted capacity for the client.
Doorman Server Tree
There could be Doorman servers per country or per data center. The highest-level Doorman servers know all of their clients - how much capacity they want and how much capacity they have. The larger the tree, the slower you'll achieve convergence. In a three node tree, within 30 seconds the tree will rebalance itself. If a client on the right-hand side of the tree begins to want more capacity, the tree will rebalance itself by stealing resources from other clients to serve the needs of the heavy client on the right.
There is a single master server for each node in the server tree. Etcd master election is used to appoint the master from a number of candidates. It keeps the entire state in RAM. No syncing with the replicas, no background storage. When starting up the server learns the current state of the world from its clients (learning mode). Clients use an application-specific discovery protocol to find the master for their node.
Doorman Server Configuration
resources: - identifier_glob: fair <= Global expressions supported here capacity: 1000 <= Globally available capacity safe_capacity: 10 <= Plays a role when the lease runs out description: fair share example algorithm: kind: FAIR_SHARE lease_length: 15 refresh_interval: 5 - identifier_glob: "*" capacity: 1000 safe_capacity: 10 description: default algorithm: kind: FAIR_SHARE lease_length: 60
Doorman in Action
These graphs show the following:
- A well-behaved client with the wants and the capacity being almost exactly the same. Therefore, there was no oversubscription of the resource or the client did not ask for more than its fair share.
- A client with spiky behavior with increased wants - it doesn't receive additional capacity until the whole cluster converges and other clients are starved to serve the needs of this client.
- The global view - even though wanted capacity exceeded max capacity at one point, the system never breached that max capacity.
This graph shows how well the client-side rate limiter determines how fast the client would want to go if it was not delayed by the rate limiter.
go get github.com/coreos/etcd go get github.com/youtube/doorman/go/cmd/doorman_shell go get github.com/youtube/doorman/go/cmd/doorman cd $GOPATH/src/github.com/youtube/doorman/doc/simplecluster foreman start & ../../../../../../bin/doorman_shell --server=localhost:5101
Have Go 1.6 and Ruby installed on your system (Ruby is required for Foreman)
Building and Running SRE Teams (Kurt Andersen, Senior Staff SRE @ LinkedIn)