Join Brandon Neill for an in-depth discussion in this video Troubleshooting introduction, part of VMware vSphere: Network Troubleshooting.
- [Instructor] The goal behind troubleshooting, is to resolve a problem. Sometimes you may be able to immediately tell what the problem is and quickly apply a fix, however much of the time troubleshooting isn't that simple. It's helpful to have a defined process for troubleshooting to avoid just haphazardly applying fixes until something sticks, this relies largely on luck to get a rapid resolution, but when that luck runs out, getting a resolution can be slow. It's also easy to get stuck. In addition, having a process makes documenting our results easier. Over time everyone develops their own methodology for troubleshooting, however, they'll usually have five common elements.
The first one is always defining the problem, we need to define the problem as clearly and completely as possible. Secondly we'll have relying on expertise or research, this is necessary to understand the environment, and, the processes involved. Next we have scoping, or using testing to further refine the definition of the problem, and then investigating, or digging into the components, to understand why they are having a problem. These three elements, expertise and research, scoping, and investigating, are not always linear, but form more of a loop as we go through the troubleshooting process.
The goal of this loop is to eliminate the guesswork, and by the time I'm ready to exit the loop I should have refined the problem to the point where the problem component and solution jump out at me. And finally, we need to resolve the problem. Defining the problem might seem easy, however it's important that I be as specific as possible, so that I have a good starting point. I start out by defining what my expectation is, for instance, this could be that I expect a webpage to load in two seconds, or that I expect Computer A to be able to talk to Computer B on port 80. And then I have my actual results, and where my actual results don't match my expectation, this is referred to as a deviation from the expected, and that's what I need to troubleshoot.
As part of defining the deviation, I need to define what I know initially in terms of what, where, and when the deviation is occurring. I also need to define the extent of the deviation, or how far off from my expectation am I. In order to do this it is helpful whenever possible to use measurable metrics, in networking, that'll often be metrics like latency, bandwidth, and connectivity. When defining the problem I want to make sure that I list out everything I know, even if it doesn't seem relevant yet. One final word on defining the problem, it's important to note that we're looking at the difference between our expectation and our result, while we are often focused on bringing the result up to the expectation, sometimes we may actually need to adjust our expectations.
For instance, if I'm expecting to get 10 gigabit of throughput from a virtual machine that is using a single sided process, my research is going to indicate that that's not a reasonable expectation, and so I need to adjust my expectation, not my result. As an example, my initial problem definition could be that Computer A can't connect to Computer B on port 80, and that's all I know at this point, so that's all I need to write down. As I mentioned earlier, the elements of expertise or research, scoping, and investigating, are the core of the troubleshooting and they form a loop, we can move back and forth between any of these elements as needed, during the troubleshooting process.
The first one I'll look at, and a usual starting place after defining the problem, is expertise and research. What I'm really talking about here is knowledge, our first level of knowledge is our own expertise, and what we know about the environment and all of the elements and processes that make it work. Once we've exhausted our own expertise, we can rely on the expertise of others, by consulting with coworkers or other experts, reading documentation, or doing online research. The first major goal of research is to document or diagram the environment, so that I understand all of the components that make up the environment in question, in networking, this will often be drawing a network diagram showing how everything is connected.
Secondly I want to understand all of the processes involved, in my case this would include understanding how network protocols work, how switches and routers function, and how traffic is passed inside a virtual machine, and inside the VMkernel. One of the goals of this course is to increase your knowledge of how VMWare networking works, so that you can troubleshoot more effectively, without having to use additional resources. Finally, using knowledge, I want to make an initial list of possible causes of the problem, at this point I don't want to do anything with the list until I've done some scoping.
Going back to my example, I can diagram all of the components and how they're interconnected, and then I can list out my possible causes, including VM misconfiguration, switch misconfiguration, or router misconfiguration. Once I have the environment diagrammed, one of the more important yet often skipped steps, is scoping. Scoping allows for me to determine which components are having an issue, and thus reduce the amount of investigating that I have to do. I do this, by testing the environment and determining which components are working, versus which ones are not working, allowing for me to refine my initial what, where, and when statements.
My goal is to narrow the scope of the problem by looking for patterns and commonalities between the components that are working, versus the ones that are not, such as, all machines in question are connected to the same physical switch, or bound to the same physical NIC, or connected to the same Vswitch, anything like that that allows for me to see a pattern, of all of the failed systems. Be careful not to forget the time component, as knowing when the problem started occurring, and whether it occurs all the time or only at certain times, can be helpful in looking for those patterns. When troubleshooting complex problems, I will usually draw on my network diagram to make it easier to see the patterns, then off to the side I'll write down brief notes about the problem and the environment and any scoping that I've done.
Going back to the example, we start off with our initial network diagram, and we know that A and B can't talk to each other, but I can do additional testing to try to determine which other components are having a problem, I can see if A can talk to D, or if A and C can talk to each other, or if C can talk to B, C to D, and then B to D. By testing all of these components and noting which ones succeeded, versus which ones failed, I can now see what component all of the failures have in common, and in this example, that would be the router. That allows for me to know which component I need to investigate.
Once I've narrowed the problem down to a single component or a small group of components, I can begin investigating and digging in deeper on that component. If I start investigating too early, before I've done sufficient scoping, I may have too many components to look at, and get lost in the sheer volume of information. Doing scoping first is important, so that we know which components we need to look at in more detail. Most individual components have multiple subcomponents, so I may need to draw a new diagram. For instance, a router has multiple interfaces, a routing table, and possibly dynamic routing protocols that are all necessary for it to pass packets.
Another good place to start your investigation is called questioning to infinity, anyone with a five-year-old, understands the idea of questioning to infinity, it's continually asking why. Packets aren't getting from subnet A to subnet B, why? The router isn't passing traffic. Well why isn't the router passing traffic? It has no entry for subnet A. Well why doesn't it have an entry for the subnet? Was it supposed to get that entry dynamically, or was it supposed to be a static entry? You keep asking questions, until you get to the root cause. If you run into a dead-end, back up and try a different path.
However, if you've done sufficient scoping, usually you'll start out on the right path. Investigating can involve looking at configuration, log files, or doing individual packet captures if necessary, oftentimes we'll have to go back and do more research or scoping, as part of our investigating process. Going back to our earlier example, we've narrowed it down to the router and we can now draw our, just focusing on the subcomponents of the router, we can take a look at the routing table on there, and we can, again, refine our possible causes based on our own expertise or research that we've done, and look at misconfigured interfaces, missing routing entries, and then we can look at our actual configuration and see, is the interface configuration correct, are the routing entries correct, and if we find one that is missing, now we know exactly what the problem is.
And that brings us to our last step, which is resolving the problem. If we've properly scoped and investigated the issue, then resolving the problem should be relatively straightforward. I will caution however that it is important to examine the impact of any changes prior to committing them. For instance, I've often seen, what is a relatively minor change to VMkernel networking, cause the host to drop all iSCSI traffic, because that traffic was going over an unexpected interface, due to the mistake in configuration. So just look around at everything else in your environment, prior to committing a change, and make sure that it's not going to potentially cause a bigger problem.
Going back to our example, we've noticed that there's no routing entry for subnet A as part of our investigating, so the resolution is simply to add a routing entry for subnet A. Next up I'll walkthrough a generic troubleshooting example.
- The troubleshooting process
- Analyzing statistics
- Testing connectivity
- Capturing packets
- Searching log files
- Troubleshooting VMkernel
- Troubleshooting VM networking