Hande problems and backsliding with incident and sprint retrospectives, giving the team more control.
- Making changes in a large organization is an incredibly difficult task. Ernest and I had read The Phoenix Project and Leading the Transformation and I'd realized there'd be roadblocks along the way. - Some roadblocks came while implementing new ideas and perspectives, while others were from inertia as we tried to change the way we operate day to day. - To make a change, sometimes you have to unlearn what you've already learned. Let it go and then try something new. - One of the earliest issues we uncovered was from including both Devs and Ops in design reviews.
- Typically design reviews happen before the implementation phase and it's one of the earliest things done on the development side. - In our traditional enterprise, operations had never been included in any of these activities. And mixing the teams, at least initially, didn't go well. - Operations and test engineers brought up issues with respect to scaling, automated testing, load balancing, and redundancy at design time. - While these were legitimate concerns, they were brought up too early in the life cycle for the comfort of developers.
Some of them said, "But I haven't built anything yet. "Why should I worry about scaling issues?" Or I don't even know why ops is in this meeting right now. - The developers and ops engineers didn't even understand each other's terminologies around the system. One of the devs suggested that we have separate design reviews. One for Dev and one for Ops. - We had to put a stop to that. Carthic and I agreed that we wanted to build a scalable system and that everyone's feedback was important and it was also important to us that everyone understood the core system architecture up front, and participating in the design reviews together was crucial for that.
In this situation we made everyone power through the initial discomfort. - This eventually led the team to a common language to communicate and rally around and it was beneficial in the long run. - Individuation demos were also critical. During each sprint, different members of the team would work on different pieces of the software. Writing a feature, writing automated tests, or building the deployment pipeline. - Big changes would often be introduced in an iteration and not everyone was familiar with them because the whole team was heads down in their own tasks.
- [Instructor] The demos really brought things together. The entire team demoing their individual work to each other at the end of the iteration helped them all, Dev, Test, Ops, product managers, understand different portions of the system and gave them a chance to ask questions and have a shared understanding of the entire product. - Our biggest issues actually occurred when things went wrong. An accidental configuration change was checked in and deployed at the end of the iteration. This ended up breaking all of our download links for our service and made it unusable for a whole day.
- We had inbound customer complaints and we even had to have a conversation with the CEO about why the service wasn't working. - When the team investigated the root cause, almost immediately everyone blamed the one Ops engineer who had committed the bad code. - We realized that this was also a habit we had to break and we used this scenario to introduce blameless postmortems into our culture. - In all of our previous situations, engineers would get blamed and punished for incidents like this.
It would force the engineer into silence and lead the practice of cover your butt engineering from fear of punishment. - [Instructor] Instead we implemented blameless postmortems where those who had participated in an incident can give a detailed account of what actions they took at what time, what effects they observed, expectations they had, assumptions they made, and their understanding of the timeline of events as they occurred. - [Instructor] And most importantly that they can give this detailed account without fear of punishment or retribution.
- [Instructor] We adopted the theory that failure will always happen. However to understand how they happen, we first have to understand our reactions to failure. - One reaction can be to blame the person who's responsible. - Another would be to take a hard look at how the accident actually happened, treat the engineers involved with respect, and learn from the event. - The postmortems stopped focusing on blame and started focusing on actually how to fix the extenuating causes behind the failures and how to improve our systems to better detect and recover from those failures.
- These were some of our major setbacks in the middle of the project. But we were able to get past them using new techniques or shared understanding. Little did we realize that we were soon to face even bigger challenges.
- What is agile?
- What is lean?
- Measuring success
- Learning and adapting
- Building a culture of metrics
- Continuous learning
- Advanced concepts