If you are a regular listener to the podcast or an e-commerce watcher, you know “the season” is very important for us. It is a yearly recurring theme in the podcast. You are probably also aware that uptime and responsiveness of our app and website are crucial. And you might have noticed that enabling our software engineers to perform at their peak is very important for us. Enabling teams and engineers is what we do to build a great place to engineer.
And sometimes things just go sour. A perfect storm occurs that is definitely not a tailwind…
As our CTO will say “never waste a good crisis”. We have to learn from what happened. Let’s explore one of those incidents. We go back to the season start of 2019. Just before the start of the Friday Afternoon Drinks, a huge incident started in our Android App. This triggered downtime in other areas of the platforma as well. And maybe just like when investigating a plane crash there is not just one thing that was off but a series of unlikely things happened in a short span of time. Let’s dive into this.
What the episode covers
- Why is learning from failures an important topic to share?
- Some context, what part of the landscape are we talking about in the episode?
- What was your perspective? What were you doing and what happened?
- Taking a few steps back: What was the process of incident management and how did we step by step fix the issue?
- When the dust settled: What did we learn? What did we improve?
- Julius van Dis – Full-Stack engineer at Flock. He was responsible for the app, specifically its direct backend. Some of the projects he has done include making the app and service landscape multilingual, the migration and integration of a new gateway, creation of a basket API and improved app updates.
Peter Paul van de Beek