The DevOps Handbook
greenfield vs brownfield
- testability and deployability stronger indicator of success than project age
system of engagement vs system of record (reject this notion)
- engagement is customer/employee facing
- record is ERP like systems. Correctness of data is paramount.
risk/reward balance of transforming assessing level of resistance
Start innovation with champions
- find early adopter personalities
- build critical mass and silent majority
- identify holdouts
Identify the value streams:
- Find out what’s hurting the productivity and feedback loop. It might not be the complaint.
- customers of value stream:
- product owner
- development team
- QA team
- ops team
- security team
- release managers
- Document the entire process of change
- client request -> translation -> dev team -> testing -> PR -> merge -> rework loops -> … -> client receives it
- Add estimate times
- Focus on huge time lag, manual gates, and places of significant rework
- %C/A: Percent Complete and/or Accurate: “how often do we deliver the ‘correct’ change–the first time–to address a client ask”
- eg PR is ready but fails in CI when it passed in dev
Build a temporary “devops transformation team”. Give them a shared goal. examples:
- reduce unplanned work by X%
- reduce PR open time by X%
- offload all release knowledge into code
- put compliance reqs into CI
Reserve 20% effort for reducing tech debt in a project by addressing:
Important: the 20% is only necessary if your project will run into the tech debt cliff.
- functional organizations:
- centralize expertise. Optimize for expertise, division of labor, or reducing cost.
- split org into functional areas.
- can work if everyone in the value stream views customer and org outcomes as shared goal.
- etsy, google, github, toyota: does this
- requires high trust, rapid response, and self service
- share the pain
- avoid silos and don’t over specialize. Enable healthy generalization
- don’t fund projects: fund services and products
- matrix orgs:
- attempts to combne functional and market but rarely works
- market orgs:
- optimize for responding quickly to customer needs
- high performing devops teams are here
- amazon and netflix
Embedding ops into engineering teams doesn’t always work if there’s not enough ops. Liasons can help.
- business relationship manager
- helps team navigate ops landscape to prioritize and streamline work requests
- dedicated release engineer
- determine what self service tools an ops group should prioritize building
- help build out infra if needed
- shared goals:
- reduce mid project surprises
- effectively prioritize global ops constraints
- self service to enable devs
- embed ops knowledge and engineers into teams when possible, employ liaisons when not
- don’t mandate, but have golden path
- templates w/monitoring, good environments, basic CI, etc.
- help influence tech choices
- influence project architecture
- create new/identify needed capabilities in internal platforms
- generate new operational capabilities
- be a true part of the team
liaison is responsible for understanding:
- what new thing is and why we’re building it
- how it works from an ops, scale, observability perspective. Diagram stuff
- how to monitor and collect metrics, o11y, for progress, success, failure of functionality
- architecture patterns and any justifications for necessary snowflake aspects
- extra infra needs
- delivery progress and plans
- also attends standups if possible
Improvement of daily work is more important than the work itself.
- Automate environments
- Modify definition of “dev done” to include running in production-like environments
- avoid “done when works on my laptop”
- Commit stage:
- automated unit test
- Stateless. Parallel. Test in full isolation (no real DB, etc).
- static code analysis
- test coverage, duplication coverage
- hard less-than-15 minute time limit.
- Acceptance stage:
- Deploy packages into production-like env
- Can run in parallel
- Run automated acceptance tests
- test api
- test that functionality operates as designed
- this includes performance and other “non correctness” requirements
- use virtualized or simulated services for testing (in memory DB, etc)
- Integration tests:
- smoke test in prod
- use real services
- can be brittle
- can take several hours to run
- Manual testing:
- comprehensive tests that validate deployability
- culture of “stop all work and fix tests” when validation fails
- stop all work for unit tests
- don’t stop + revert on acceptance and integraion tests
- add a unit test to catch failed acceptance validation test
- add acceptance test to catch failed integration
- defects found in exploratory/manula testing should result in unit/acceptance test
- small batches on trunk vs long lived feature branches
- (works less well for OSS models)
- Start with zero tests.
- Add tests that are 100% reliable and are performant.
- If a test ever generates a false positive, remove or rewrite it
- Revert code if necessary to get pipeline green again
- “Works across X versions of tooling/language/os/etc” is usually not a functional requirement
- Test this after unit, maybe in parallel with acceptance.
- optimize for individual productivity: feature branches. Team thoroughput is awful
- optimize for team productivity: trunk dev. Merging is easy. Individual productivity can be “lower”
- ironically this increases ability to do refactoring
Successfull deployment requires devs working with ops to ensure processes don’t reinvent the wheel and that as little knowledge and/or handoff is required to do anything as is posible.
Decouple deploy from release. Decouple database changes from app changes. Decouple individual microservice releases.
Architect for low risk release
- monoliths are good. But be able to switch between monolith and miroservice easily.
- make all APIs versioned
- branch by abstraction, exand contract, strangler pattern
- debug logs (development troubleshooting)
- info logs (events)
- warnings: stuff that could become an error (api call taking too long)
- error condtions: you can be woken up on call for this
- fatal: when we have to terminate
- make metrics part of daily work
- areas of observability:
- business level (ab testing results, conditions of satisfaction)
- application level (transaction times, app faults)
- infra level: (traffic, cpu load, disk usage)
- client level: (app errors, user measured times)
- deployment pipeline level: (pipelne status, lead times, deploy frequencies, test env promotions, env status)
- outlier detection
- detect problems with statistics and separate false positives from actual issues
- FFT, linear regression, “smoothing”, throw out anomalies, etc. Most “real” stuff is not gaussian
- find leading indicators and monitor lagging indicators (but only to determine that your leading indicators are effective)
- What to monitor
- delete all alerts
- after every user visible outage, ask what indicators would’ve prevented it: add those
- now you have alerts that prevent outages rather than ones that indicate it already happened
UX: Contextual inquiry. Watch the customer use the application in the natural environment. Watch for compensatory techniques (googling around, copy and paste, fiddling, tweaking). This will not show up in Q4s but is where most pain happens.
Applies to compilers and devops too:
- Analyze code in the wild. Make it elegant. Document that elegance.
- Analyze project infra in the wild. Make it easy. Document that easy. Replace the custom with the easy.
Dev and Ops can be separate if Ops can hand-back and make dev fix the problem from an ops perspective.
- Example: if submodules are too painful, ask devs to not use them before touching the code
- every team that wants to deploy at google gets a launch readiness checklist and hands-off readiness review.
- SRE helps them define this, work with it, make sure it happens, and assist if needed.