From the course: DevOps Foundations: Distributed Tracing

Why do we need distributed tracing now?

From the course: DevOps Foundations: Distributed Tracing

Start my 1-month free trial

Why do we need distributed tracing now?

- [Instructor] Every minute of performance degradation or downtime in your applications, could cost millions of dollars to your business. Weaving in the right diagnostic capabilities in your apps is key, to allow for rapid triage of problems. Let's discuss what distributed tracing is and why it's a critical part of system observability. Distributed tracing is a relatively new technique compared to the canonical log and application KPI metrics, that most people use for monitoring. Distributed tracing is becoming a necessity nowadays with the increasing complexity of apps, built on top of middleware and microservices. Here's a rendition of a Cloud native app, my daughter Sylvia drew from me. Does it remind you of your apps? It is humanly impossible to pinpoint bottlenecks in a complex app transaction touching hundreds of systems, by just looking at log data and traditional KPIs. Distributed traces capture each discreet unit of work for the app, and its path through a distributed context. For a claims processing app, a trace could record how a single claim is processed throughout the tiers of microservices, handling different parts of a transaction. Each part could occur in a local or remote system, and is represented as a trace span. A distributed trace is a collection of sequence spans with timing information about each. It lets you see the path and timings of each step of a transaction, as it traverses your system. You may be thinking, "Great, now I need to worry about one more thing to manage and use." But let's take a look at a few problems examples, that may convince you it will be worth the effort. We categorize our problems into three categories, logic flaws, dependency failures, and workload hotspots. The first category is app logic flaws. For example, an untested scenario may surface in production, with a null pointer condition creating performance problems. Logging the null pointer exception, will give us a clue we have a problem. But with a distributed trace, we could pinpoint what a user was trying to do, in a tier above, where the exception happened. The second most common category is dependency failures. For example, DNS lookup may be slower than expected, or a microservice call could be causing the app to slow down. This is where distributed tracing is extremely valuable. In a distributed environment, there is always a chance some tier downstream is unresponsive, or the application is accessing the wrong service. Knowing how a dependency performs or if it's failing helps us narrow down the issue very quickly in the exact place of the trace. Alternatively, we can spend days sifting through hundreds of logs or hope we're capturing the exact metrics we need to find out. The third category is workload hotspots causing downstream overload. production systems may have different workloads than were tested for, resulting in the CPU not being able to meet your app's demand. Or perhaps marketing just sent an email blast, that the systems aren't size for. Although logs, system and application metrics give us some idea about the increased volume, the distributed trace will give us context on how the frontier requests, multiply the workload downstream. For example, a single request of certain type can generate hundreds of database calls downstream, and overloaded it. But the database isn't the source of the problem. So logs and metrics just from it don't show the true cause of the issue. Distributed tracing will help you gain your sanity back, when you're diagnosing complex app problems and help you prove you mean time to resolution.

Contents