From alerting to during to post incident, great communication is the key to effective incident response.
JJ Tang
Many of the concepts SREs take for granted about incident management originated with efforts to fight fires in California in the 1970s.
SREs face special challenges during the holidays. Here’s how to manage them.
Although every company can benefit from SREs, some need SREs more than others.
A history of Site Reliability Engineering from its origins at Google in 2003 to the present.
An explanation of the meaning of SLA, SLO and SLI, and how SREs should use each concept to manage reliability.
Learn about the key roles within an incident response team, as well as optional incident roles you may not have thought about.
An SRE’s analysis of the October 2021 Facebook outage.
A comprehensive definition of SREs and Site Reliability Engineering, including what SREs do and what makes SREs different from other roles.
Maintenance of your incident management practice is as important as creation - find out what you can do to keep your engineering organization strong and consistent year over year.
The Four Golden Signals of monitoring and observability get a lot of things right. But they could be even better.
Although the fundamental concepts of site reliability engineering are the same in any environment, SREs must adapt practices to different technologies, like microservices.