Tech

Postmortems

Mason Yu

24 Feb 2025 — 4 min read

I'll admit it; I just couldn't ignore the opportunity to follow up a post about death with a post titled "Postmortems."

Postmortems, in the corporate context, are documents created after the occurrence of an outage or similarly undesired event with sufficient negative impact. The postmortem for that incident details the root cause analysis, as well as proposed (and hopefully deployed) changes to prevent, or at least reduce the likelihood of, a similar event in the future.

Most often, postmortems are kept company confidential, but sometimes the user facing negative impact is so great that a polished, redacted version of the postmortem is made available to the public (cf. Crowdstrike).

Google has (or had when I worked there) a "blameless postmortem culture." Especially working in Ads, one might imagine it being stressful having caused a production incident that can measured in dollars lost by Google per second, so having the reassurance that one would not be held personally responsible for the ramifications of well intentioned, but ultimately outage causing actions is important.

So if postmortems don't focus on blaming individuals, what do they focus on? The guiding perspective is that outages happen because of failures in processes, rather than individual mistakes. Let's set up a scenario for illustration. The engineer on call for a major production service is paged by an alert reporting high error rate for the service in North America. They correctly identify the root cause as underprovisioned servers serving overload errors and submit an updated replica count configuration upsizing the servers in North America, as per the team's playbook. Unfortunately, the oncall has made a typo in the configuration that exposed a latent bug in the capacity provisioning control plane that resulted in all servers hosted in North America being taken down.

A "blameful" approach might focus on the typo and attribute the outage to perceived carelessness, but an important realization is: people are going to make mistakes. To have production stability depend on every individual who ever changes configuration to never make a typo is inane.

The blameless approach looks at this situation as a process failure. Can we have a blocking presubmit automatically testing the configuration against a staging environment? Can we add or improve a linter to flag or automatically fix this class of typo in the configuration? Can we fuzz for this kind of misformatted configuration to test that we can gracefully handle more kinds of bad input?

There are plenty of things one can add through processes and tools that can make it easier to do the right thing or harder to do the wrong thing. It's probably impossible to eliminate the possibility of outages, but the attitude of the blameless postmortem process ideally makes it less likely that identical outages happen again.

OK, so that's nice. I think this attitude is similarly helpful as we take a look at self improvement and lifestyle optimization.

I've tried at various points of my life to introduce healthier habits, like exercising three times a week, or having a more consistent bedtime. These attempts usually sputter out and after reflecting on the lack of consistency, I conclude harshly that, "I must just be {lazy, undisciplined, unmotivated} and should just give up on this."

I think this is similar to a blameful postmortem. "Why did this attempt to change myself fail?" "Well, there must be intrinsic qualities in me which make me unsuitable for that change." Recalling the challenges of maintaining mindfulness and changing habits, it's a very human experience to find this process of self improvement a struggle, so the blameful approach isn't helpful.

What might a blameless postmortem look like for the self?

Firstly, any self directed critiques will tend towards offering more grace. Acknowledging the challenges of self-induced change, to focus on brainstorming processes that can help remind us of new habits we want to create, or reduce the barrier to beginning an activity that aligns with our intended change, rather than seeking to attribute the lack of success to innate qualities.

Atomic Habits by James Clear suggests an approach along this line, to use "habit stacking" to attach new habits we want to old habits we already have, like lining up "go to the gym" after "having a cup of coffee in the morning" while also doing things like having your gym bag and athletic shoes in the ideal location close to where you typically have coffee.

I want to have a data driven approach to this process, to identify preexisting trends or habit chains which are beneficial as well as those that are unhelpful. For example, I might want to ask the question, "How many hours did I spend browsing Reddit in 2024? Or driving? Or reading books? Is there a pattern or correlation with other activities? Or the amount of sleep I got the night before?" There are a variety of apps to handle the digital side of things, but there's not a good universal solution to input, store, and analyze more general data on our behavior and how we spend our time.

I'm in the early stages of trying to come up with a solution to this. Various productivity gurus across platforms have covered things like bullet journaling, time blocking, habit tracking, etc. and I think there's a solution to be distilled from the combination of these approaches to better inform people as they embark on their journeys of self improvement.

A commonality I see amongst these approaches is a mechanism for self-accountability. Freehand journaling is at the very least an opportunity to do some reflection on how the day went or how a week went, even if that particular entry is never read again. Bullet journaling and habit tracking provide a more structured set of goals to reach or habits to maintain that enable more long term accountability for how often things we want to do get done. Time blocking (if one sticks to the proposed time blocks) provides a very sharp image of how exactly we spend time in a week which is the most specific kind of behavioral accountability one can enforce: not only what we do, but when we do it.

I suppose my ideal "blameless postmortem culture" for self-improvement would be to use some sort of tool periodically to assess how time was spent, and identify opportunities for improvement through the introduction or expansion of certain systems or processes.

Postmortems

Mason Yu

Read more

So teach us to number our days

Vision - why our hopes and dreams matter

Buying time

About OKRs