Discover that Node applications consist of a multitude of layers, including services, modules, and custom code. Learn how to isolate the problem to find the solution.
- [Instructor] Everybody wants a fast and reliable site. Awesome, static HTML for everyone. No moving pieces means almost nothing can break. However, we also want sites to actually do things. So we turn to dynamic sites with scripting languages like Node.js. This way, we can build wonderful, interactive sites. Unfortunately, one of the risks of a dynamic site is that with more moving pieces comes more chances for error.
According to Steve McConnell in Code Complete, A Practical Handbook of Software Construction, the industry average experience is about one to 25 errors per 1000 lines of code for delivered software. Errors happen, and that's normal. The question is, what do we do about errors? We can't just ignore the problem and hope it goes away. This course's goal is to empower you to diagnose and fix quality and performance issues in Node.js sites.
We'll do this with a combination of theory, tools, and techniques. In the first chapter, we're going to build a constructive troubleshooting mindset to use when dealing with problems that affect sites. This way, we'll have a common theoretical background to practically apply throughout the rest of this course. We'll start with the art of finding what went wrong, which includes how to frame your investigation. Then, we'll look at how to effectively measure performance beyond just saying that the site's slow.
We'll compare different ways of documenting bugs, so we can communicate problems with others. Finally, we'll learn what it takes to effectively resolve problems and why that's important. When something goes wrong, our instinct is to blame. Blame, when used as a verb, means to assign responsibility for a fault or wrong. In the Oxford English Dictionary, the example even pins blame on the engineer for an accident. Oh no, the site's slow, whose fault is it? Let's find out.
It's the marketing team's fault, they wanted the analytics and they're slowing down the site. It's the engineering team's fault, they didn't architecture the site correctly. It's the users fault, because they're abusing the services. Suddenly, the site's down. Oh my, I know it wasn't me, so whose fault is it? It's operation's fault, they didn't put the site in multiple data centers. It's the host's fault, they didn't have backup power generators. It's nature's fault, that thunder, snow, hurricane, flaming meteor wasn't on the Share Team Calendar.
Technically, all those statements could be true, but reality is more nuanced. Marketing needs business intelligence and they use industry-standard tools that they're familiar with. Engineering structure the site based on the incomplete requirements they received. There wasn't any mechanism for detecting user abuse, rate limiting usage, or banning offenders. The site is in multiple data centers due to geopolitical differences around user privacy. The data center did have backup generators, but they hadn't been maintained and half of them failed.
And sometimes, a giant flaming meteor happens and it's not on the product roadmap. The act of blaming is cathartic. Directing or redirecting frustration feels good. See, it's not my problem, it's theirs. However, blame is harmful because everybody is at fault and it's a group responsibility. Finger-pointing doesn't help. Instead of blame, determine what went wrong as a constructive step when responding to a problem.
Unlike assigning blame, the goal is to resolve the cause of the problem, so they don't happen again. This is a shared responsibility. Marketing should review how they're using their tools. Engineering should find a more flexible architecture for serving content. User activity should be more appropriately monitored for abuse and a mechanism should be built to handle it. The legal and the business teams need to find common ground on user privacy in order to comply with local regulations. The data center needs to maintain their backup systems.
And giant meteors should be planned for, within reason of course. In order to investigate effectively, you need reporting and quantifiable data. Where do you go for this? Well, what did the application log say? Do you have logs? Do the log messages give context about what's happening and how? Can you find and trace the exact request that failed? Finally, can you reproduce the problem with what you know? Debugging doesn't require logging, but having a system act or fail silently doesn't help.
The problem can exist in multiple places and some solutions require changing multiple things. That's okay and it's normal. The goal is to discover what went wrong and where to fix it. All right, so what do I do when the site's working but it's slow?
Released
7/13/2018- Building a troubleshooting mindset
- Why measure performance?
- What's a microservice architecture?
- Managing microservices with PM2
- Effective logging strategies
- Debugging Node.js applications
- Benchmarking performance
- Profiling code execution
- Knowing what to optimize
Share this video
Embed this video
Video: Finding what went wrong