The development of the SRE role in DevOps

O’Reilly — Site Reliability Engineering — book cover

What is Site Reliability Engineering?

According to Google, “SRE is what you get when you treat operations as if it were a software problem. Our mission is to make the software and systems behind all of Google’s public services (Google Search, Ads, Gmail, Android, YouTube and App Engine, to name just a few) to deliver, protect and improve, with an ever-vigilant eye on their availability, latency, performance and capacity.”

What does a SRE do?

An SRE uses development skills to make the system (platform, infrastructure, environment)(more) reliable. In this case, making reliable means taking responsibility for availability, performance, efficiency, change management, monitoring, emergency response and capacity planning of the service(s).

  • embrace risks,
  • work with Service Level Objectives (SLOs),
  • eliminate mundane repetitive operational work (Toil),
  • monitor as much as possible,
  • automate as much as possible,
  • make sure releases are consistent,
  • and lastly keep-it-simple!

How do you become SRE?

It starts with the drive you must have to discover where things can be improved and to take this up as a challenge. You have to want to continuously improve. You are curious about the (technical) possibilities and use them to improve and further develop the current situation.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Richard Hoedeman

Richard Hoedeman

Accredited trainer / coach for Lean-IT, DevOps, PRINCE2, Agile PM, PRINCE2 Agile, OBM en Scrum.