Incidents happen, and we can and should always learn from them, to be better prepared for the next time things go wrong.
A great tool to do that is the post-mortem, it is a process designed to recap the incident, learn from mistakes and improve the system as a result.
There are some basic principles that can help achieve a good post-mortem process. They are only guidelines and I recommend adapting them to what works best in your organization.
Blamelessness & Hindsight bias
In hindsight it is easy to say things like
I should have done X or
I forgot to do Y.
However it is important to not focus on who did something wrong, but instead focus on the system level look at the incident,
identifying where processes where missing, failed or did not cover the failure mode at hand.
Always remember, we are doing this to be prepared for the next time, when we will be equally unaware of entirely new problems.
Like with any process there need to be somewhat defined desired outcomes, this could be improvements to the code, incident process, monitoring & alerting or infrastructure.
At the end of any post-mortem we should be able to define action items of different kinds that will help us in future incidents.
They range from direct action, like fixing the root cause of the incident, to more indirect actions that are preventative or analytic in nature and may take a longer time to implement, like creating new metrics and alerts, changing the architecture of the system or accepting failure and handling it better.
It is important to assign an owner to each action item, if this is a manager or a contributor may depend on the action item and your organizations structure.
A basic timeline of the incident should be created before the post-mortem meeting, this allows the meeting to start more quickly, getting people back into the situation and steering everyone to focus on the time between start and closure of the incident.
Choosing the starting and ending event is really important here and takes some practice, however it’s always possible to get entries before or after the chosen start while running the post-mortem meeting.
In addition the timeline can be pre-filled with events from your monitoring system and the times different people joined the incident resolution process.
The post-mortem meeting
In order to run a good post-mortem meeting you should prepare a basic timeline in a scratch-pad style environment where everyone can contribute to it. This could be a physical or virtual whiteboard, a Google doc or anything that allows collaborative editing.
The meeting should be scheduled not too long after the incident as to not forget the details of it.
Who should join the meeting?
Everyone that was involved in the incident resolution process should be present to give input, however there could be additional people like the systems architect or owner that can add relevant insights.
If your post-mortem process is well established it is also an interesting idea to allow any engineer in your organization to join silently and learn from the process, you may even find that they can add valuable insights or volunteer for engineering work making the system more stable in the future.
Depending on the context of the incident and post-mortem it can be a great idea to designate a devils advocate, who is tasked with challenging the assumptions and results of the post-mortem.
Make sure to talk to the chosen attendee before the meeting to make sure they understand that they should only challenge technical factors and absolutely not other attendees personally as per the blameless principle.
This is a basic structure to run a post-mortem meeting that worked great for me in the past:
Set the scene:
Remind the group of the blameless principle, hindsight bias and to focus on what instead of who.
Recap the incident:
Show your prepared timeline, give some context around it and maybe suggest an area to focus on to start.
Start by asking a attendee a open-ended question leading towards the incident and let the group develop a flowing conversation about the incident, steering them to stay on topic and blameless as required.
While the group is discussing the incident, try to formulate some action item ideas, fill out the timeline and ask any clarifying questions to create a better understanding of the incident.
Go over your action item ideas and let the group add or remove any by consensus.
In the end all action items should have an assigned owner from the group that feels responsible for working on them.
A note on implementing this as a process
Post-mortems work best when they are used as a regular tool in your improvement cycles, so after trying them out it is best to implement them as a process for people to follow.
This could include:
- A template for the post-mortem document, this enables people to depend on the structure and easily understand new documents
- A way to track resulting action items, I found it best to use labels or tags in your usual ticket system for that
- Guidelines on when to invoke the process, for example: Every sev1 incident automatically gets one, lower severity incidents get one at the discretion of the incident manager
- A pool of moderators for the post-mortem meetings with responsibility to prepare and run the meetings