From the course: Cybersecurity with Cloud Computing

Unlock the full course today

Join today to access over 22,600 courses taught by industry experts or purchase this course individually.

Anatomy of a service failure

Anatomy of a service failure

From the course: Cybersecurity with Cloud Computing

Start my 1-month free trial

Anatomy of a service failure

- [Instructor] In September 2019, Google's VP of engineering apologized to customers for taking far longer than the company expected to recover from a configuration mishap which impacted 1% of more than 1 billion Gmail users and disrupted YouTube and Google cloud storage traffic. Google's cluster management software is used to automate the response to data center maintenance activities, such as power infrastructure changes or adding additional network equipment. Cluster management jobs are labeled with an indication of how they should behave in the face of such an event. They can be moved to a machine which is not under maintenance or stopped and rescheduled after the event. The incident occurred when two normally benign misconfigurations were made in the presence of a specific software bug. Network control plane jobs were configured to be stopped if maintenance took place. All the instances of cluster management…

Contents