From the course: Kubernetes Essential Training: Application Development

Ensuring availability with liveness and readiness probes - Kubernetes Tutorial

From the course: Kubernetes Essential Training: Application Development

Start my 1-month free trial

Ensuring availability with liveness and readiness probes

- [Instructor] One of the main reasons to use Kubernetes is that it automates many traditional operations tasks. We've already seen that with just a little information, really, just a container image. Kubernetes can already do a lot. Pick a worker node, start a container, configure a load balancer. Lots of the little things that keep us busy day to day. One of the other really useful things it can do is simply keep our software running. However much we've moved fast and broken things. So I realize I've asserted a lot that Kubernetes will restart start a pod if it crashes. But I've not actually shown you it. So just so you believe me, let's go through that now. What I'm going to do is I'm going to make a deployment of one pod. And this is running a new image. A little utility I wrote called envbin. I've still got the same Minikube Ingress setup that we saw in an earlier video. So I can also apply a definition of a service and a definition of an Ingress. So with those in place, we should be able to come up into our web browser. And address, envbin.example.com. And here we go, we got a little service. It prints out a bunch of information about itself. But while we want to look at is every time this process starts, it gives the session and name. So we can tell whether we're talking to the same instances before or not. So this one's called itself clever_nobel, and it's had one request. And that's the request that we just sent it. Now what I can do. Actually let's reload that. Okay we can see we're still talking to clever_nobel. The program hasn't restarted, but it's seen the second request. Now, if I come down to the bottom, I can tell this process to quit itself. So pick an exit code, maybe four and hit Exit. And it's saying it exited code four. If we go back and try to reload that. See now practical_almeida. So this is a new instance of this piece of software, it's been restarted. And this is already seen two requests cause I got a bit eager and press the refresh button twice there. So if we go back to the terminal, we will see. Let's look at the pod direct. Here's the pod that was made and it's had indeed one restart. So it quit. It effectively crashed because it exited with a nonzero return code. Kubernetes noticed, stepped in and started it again as fast as it could. And I wasn't even able to get in there, press F5 while it was restarting. So that really did happen very quickly. It's the same pod. This pod name hasn't changed. So this metadata wrapper around the container has stayed the same. We still have an intent to run this pod to the same intent that we always did. But in order to actually keep it, giving services had to be restarted in place once. So it's obvious when a program crashes. Kubernetes can notice that by itself as we've seen. But there's really only so far I can get on its own. In order to really help you out, it needs to know more about your services. Like any site reliability engineer, it needs to understand what it's operating. Now throughout this whole chapter, I'm going to look at this topic of teaching Kubernetes about your service. Telling Kubernetes about the nuances of your programs so that it can do a better job of running them. For example, a pod might've lost a connection to the database it relies on. Or it could have corrupt internal data and only be able to return errors. Or it could be deadlocked completely and not responding at all. In all these circumstances, the pod isn't providing its clients with any kind of useful service, and Kubernetes needs to come and fix it. And very often the best fix is to well, turn it off and on again. That might sound facetious, but there's actually a lot of good reasons for treating cloud native software like that. And for designing them to be treated like that. If you want to learn more about that topic, I suggest you check out the Kubernetes microservices course. But enough talking. What I'm going to do is deploy a updated version of the deployment for that pod. So this will change it in place, override over the top. And if we look at pods, we can see if your eagle-eyed, there's a new suffix on there. So this is a new pod. We've changed the deployment in place, but the deployment now has a new pod template. And what it's done is it's removed the old pod based on the old template and made a new one based on the new template. So this one is running afresh. It's only nine seconds old and it's had no restart. That account has been reset to zero because this is a new pod running a new instance of that container. So let's go back to envbin. Reload it. Notice that again there's a new session name because the new pod started a new copy, only one request. And let's tell it to play dead. Liveness check set to false. Let's try to have a look again. Ah. Briefly, there was a bad gateway and now we've got yet another new version of it, that's only had the one request. So what happened there? Let's have a look in the terminal. So the same pod has before from the new definition we applied up here. But one restart. Well to understand this, let's have a look in the YAML that I applied. One of the changes or the change that we've got here is we've declared a liveliness probe. I've told Kubernetes how to check if this pod is okay. And I then told the pod to say that it wasn't. So it got restarted. What the liveness probe does is try an HTTP end point that we tell it. In this case, port 8080, HTTP path /health. And if it gets no response or it gets an HTTP error code, then Kubernetes is going to assume that the pod isn't okay and it'll restart it. And I told envbin to return an error and to say that it's not okay. Now there's a second type of probe, called a readiness probe. And this is how Kubernetes probes to your service to see if it's in a state to accept user requests right now. This is usually used to detect when the service itself is okay, but for some reason it's unhappy with its environment. Maybe momentarily it can't talk to its database, so there's no point in it receiving user requests. But it itself isn't broken. Let's have a look at that in action. The first thing we need to understand is I'm going to show you a new kind of resource called the endpoint. kubectl get endpoints. So like deployments make pods for us, automatically so we don't have to. Services make endpoints. There is an end point object that represents every pod that matches a service's label selector. So there's a service for envbin. With the label selected we've seen that matches the labels on in this case the one envbin pod. So the service has found the one pod that it's going to send traffic to that matches its label selector. And it's made an endpoint resource for that. And this is the IP address of that pod. And the port that that pod is listening on. This means that all requests to the envbin service are going to go to this one pod. And this is how we find it. If there were more than one copy of this pod, then traffic would be spread between them. So bear that in mind while I update this deployment one more time. Going to apply another file. And you can see it configured in place. This time the pod has a readiness probe defined. So I can come back to envbin. We've got another new session because we changed the deployment. So the old pod was deleted, the new pod was made. New container was started. And we can lower this time its readiness probe. If we try to get to it now, it's the same error. So there is a gateway sat in the way. This is actually the Ingress box that I was talking about back in chapter one. And this Ingress box is saying 503 gateway error. I'm trying to service your request, but I don't know what to do with it. I've got nothing to talk to. And this is going to persist forever. Unlike the brief one we saw, when we lowered the liveliness probes. Kubernetes quickly stepped in and said, ah, it's completely dead. I know what to do with this. I'm going to restart it. This is going to live forever, or at least until we do something. And let's see why. If we get those end points again, we'll see that there are now no end points for envbin. Because what end points actually are is there's an end point for every pod that matches the services label selector. which is ready, which is currently able to serve traffic. Otherwise there's no point sending traffic to it. So we only had one envbin pod, it was our singular end point. And now that's saying it's not ready. It's lowered its liveness probe. So Kubernetes is probing it like we've told it to, and it's saying, it's not ready. It doesn't want any traffic. So Kubernetes has says, well this service simply has no compute, no pods behind it. And that's exactly what that Ingress box is finding out. It's trying to talk to the service and the service is saying, well, there's nothing I can do for you. I have no ready applicable pods. So we're getting a gateway error from the Ingress. Now this is an extreme example because you should of course be using your deployment to run more than one copy of the pod. Precisely for redundancy, precisely for this scenario. Then if one pod like this becomes unready, there are others left to deal with the requests. It's worth saying that the other major use of readiness probes is to indicate that a pod is still starting up. So when a pod first comes into existence, if it needs a while to preload a bunch of data or to precalculate some results, it can start with its readiness probe down and it can leave that readiness probe down until it's ready to go. And that can take a minute or so, depending on how complicated those calculations are. So to finish this off, just a few hints and tips about probes. Because they are quite subtle in some cases. You're best to try to test a realistic endpoint. So if it's a web app, then check one of its main pages. If it's an HTTP JSON API, check an important path. It'll have to be an authenticated one, but an important path. Because remember what you're doing here is you're describing your application to Kubernetes. By writing the readiness and the liveness probes, you're telling Kubernetes how to probe your app. So if your app is a black box, if it crashes Kubernetes can tell because every app looks the same when it crashes. What it doesn't know is what the paths are to the pages in your website. What the paths are to the API endpoints in your service. You have to tell it that in order for it to be able to test it and probes are the way that you do that. Really what we're trying to avoid is having some background thread or go routine that's completely decoupled from the rest of the code. And it just says, okay, while everything around it burns. So a missing config file might leave your service completely unable to do anything except say, sure, everything's okay to a liveliness probe. Because that's the one part of the code, that liveliness probe handler, is the one part of the code that doesn't depend on any config values. So it's ironically the one that's working. So try to tell Kubernetes to make the same kind of request that a user would. So you know whether the service is okay from their point of view. You can have a separate health check end point if you need to. Like I did in this artificial example. But the the code behind that should actually go and do some stuff. It should check if the config loaded correctly, it should check if the app was able to listen on all the ports it wanted to. Et cetera, et cetera. Whatever healthy and okay looks like for your app. And only you can know that. So you write the code to ascertain that and then you describe that to Kubernetes. One more point is you should only be returning the status for your own pod. If you can't talk to an upstream service, then you should leave your probes as they are because there's no point in Kubernetes restarting. You'll just be in this massive cascade if everybody works like that. It needs to restart the pod where the actual problem is. Everything else in the meantime can just lower its liveliness probe and say, well there's no point in talking to me because I can't talk to the upstream guy. But I don't need to be restarted, I haven't done anything wrong. I'm just waiting to talk to one of my servers.

Contents