Join Michael Lester for an in-depth discussion in this video Information systems operations, part of CISA Cert Prep: 4 IT Operations, Maintenance, and Service Delivery for IS Auditors.
- [Instructor] All right, let's talk about IT operations or IS operations. So whenever I say something is operational or it's an operational responsibility, that means it's essentially a day-to-day responsibility. So in IT, if you're an IT person, you have these day-to-day, routine, mundane operations, like performing backups, patching your servers, maintaining things, making sure things are up and running, checking the security on something. That's what operational stuff is. So in this section, everything we deal with is a day-to-day, sort of routine thing.
So as I say, they are the routine activities that keep things up and running, keep the production environment up and running. That means provisioning new stuff. We have to put a new server into the environment, a workstation, a router, a switch. Applying patches and hotfixes, performing backups, doing test recoveries, doing restores. Maintaining all the different media in the environment, whether it's CDs and licenses or backup tapes, stuff like that, hard drives. Doing configuration management, which we'll talk about. That's essentially keeping everything recording, diagram documented.
And then performing incident handling when things pop up. If something actually turns into an incident how do you handle it, how do you escalate it? So let's talk about asset identification management first, keeping track of what you have. What assets are actually in your environment? Things like all the hardware, all the firmware that you may have, operating systems, the environments that you run stuff in. Like what Java runtime environments do we have? What applications do we have running in our environment? What licenses do we need to have for those applications? What libraries may we have? If we have individual binary libraries like DLLs, etc., what assets of those do we have? And typically you use some kind of asset management tool, some automated system to scan, record, and just like a database, store all of your assets so that you have some way to tell when new things popup and when you wanna retire something from the environment, you can track it that way.
That's a day-to-day operational responsibility. Patching is a big deal. Applying patches is critical these days and it's a very difficult arms race between the things that come to exploit your unpatched system and getting the actual patches to the environment. And it's a critical thing. You've got patch and no one actually patches fast enough, but we gotta do our best. It's important that when you're patching to make sure that you patch all of the different components of a given system, that's all the way from the bottom, at the hardware and firmware level to the operating system and it's patches, to then the environments that run on those operating systems, like a Java environment, and then the applications that run, perhaps in those environments or on the operating system, and then the components of those applications, and on and on, databases, database management systems.
You have to patch all the different levels that need to be patched, because there's vulnerabilities that can open up. There's problems, there's glitches. And it's a never ending arms race, as I say. Now when you're patching, it's important to have a good process in place to actually get the patches in. Now that typically means you have some kind of infrastructure, that means a team or some asset identification tool that you use to scan the environment and make sure that you're getting everything that needs to be patched, all the components, all the different versions, all the assets that are out there.
Then you perform some kind of research to figure out what patches need to be employed. Are there any new patches? And the most important part is testing, making sure you test what the results of this patch will do and whether it's gonna break more things than it fixes. And importantly, before you actually push the go button and deploy the patch, you come up with some kind of mitigation, some kind of roll back process. So we don't do the, what we often refer to as the Big Bang approach, where you just get the patch, install it, and hope it works. You do a test first, see what the results are and you have the way to get back to normal if that patch goes south.
Then you go and deploy and you importantly deploy to the least sensitive assets first. You don't go straight to the most important asset, the most critical system and patch it first, you patch the least important stuff, and you work your way to the most important stuff. And then finally, once it's patched, you perhaps wanna scan to see if the actual patch has been deployed. Maybe a vulnerability scan, where you can see, oh, that issue has now been patched, and then you log and you make sure that it's actually recorded. So now if you know week from now if something goes wrong, it might have been that patch and you can check your log and see.
And you also have some record to see how well you're doing your patching. That's essentially a good process for getting patches done. Now issues with patching. Well, of course, they can open more vulnerabilities than they fix, often times. You get production interruptions, because what do you gotta do every time you install a patch? You gotta reboot half the time. If you haven't done your proper testing and you don't have a good rollback procedure in place, well, you're gonna get what you get. If you get the Big Bang approach, you may get stung by it. If you don't have an accurate asset management database, if you don't know where all of your Microsoft IS servers are, for example, when you need to patch them, you're gonna miss a couple, and believe me, the bad guys are gonna find that one or two servers you didn't patch and they're gonna use that as the sort of launching pad to exploit your entire environment.
And then finally, patching is that never ending arms race. The typical IT admins workload is strenuous and patches are time sensitive. You've got to get them in now, before that zero day exploit comes along and pops your systems, so no one ever has enough time to patch well enough, but you gotta do your best. Configuration management. So configuration management is a term that many people use synonymously, unfortunately, with change control or change management. There is a difference and it's an older school term for what configuration management means.
Configuration management you can think of as management of the logical description of the IT environment. The logical description being all of those diagrams, the documentation, perhaps the configuration files of everything in the environment, the hardware, the software, the settings of the hardware and the software, the settings of the appliance. It might even include source code. Some organizations even consider the policies and procedures and standards and all those other documents part of their configuration. This is the configuration of the IT environment you would say.
Ideally you store the configuration, all of this documentation and config files and whatnot, in a configuration management database, a CMDB. And if you've studied ITIL for example, it's all about putting stuff into the CMDB. In a well run environment, there really shouldn't be anything in the environment that isn't recorded in the CMDB, in somewhere, in some document, some diagram, some configuration saved setting somewhere. It should be all recorded in there. There's shouldn't be anything in that environment that isn't recorded.
That's configuration management. Now change control or change management is all about managing changes to your configuration. So you establish a baseline by looking at your configuration management documentation, all those diagrams and documents that we just talked about in the last slide, and then any changes to that configuration, must go through some formal process. Any change to the baseline, now goes through a process, there's an approval, there's a recording of the change. You decide whether it's something we wanna do, whether it has security ramifications, what's the cost, etc., and somebody votes, or maybe it's an authorization and then you have the change put in and recorded in a very structured way.
That's change management or change control. Release management's very similar. It's the idea that you have a change control for your production software, so that all the different versions in the environment are recorded and approved. It's a process to ensure that only the authorized versions of software actually get released into that production environment. That release management. Another similar term that's very similar to the term change control or change management. Enterprise monitoring is another IT operations responsibility.
And so in your network or security operations centers, we typically record all of our event logs. Event logs are the things that we record for the purpose of troubleshooting things, perhaps doing a forensic investigation, perhaps performance tuning. And so we log all of our events from servers, from routers, switches, firewalls, etc., for the purposes of figuring things out later, troubleshooting, maybe figuring out, that oh look, this hard drive's about to fill up. And so by a log file being recorded on how much disk space is left or who accessed this file, we can figure things out later.
We can put the pieces back together. Traffic monitoring the same thing, figuring out how much traffic is going to this destination versus that destination, what type of traffic it is. Security monitoring, scanning for things like vulnerabilities, perhaps doing some probing to see what we can detect. Can we enumerate user names? And can we enumerate who's logged into what? What versions of software we've got running? That's security monitoring. Looking for suspicious activity or sources of intrusion. Those are all wrapped up in what we refer to as enterprise monitoring.
The logging, the traffic monitoring, the security monitoring and they're typically done for those NOC or SOC, the network operation centers or the security operation centers. Problem management is all about dealing with problems in the IT environment. So whenever the problem pops up, whether it's software or hardware or what, it's all about tracking it back to what caused this problem and then figuring out how to prevent this problem from happening again. We have a problem popping up, what do we do with it? Who's gonna handle this problem? Who are we gonna assign it to? And how are we going to make sure that this is communicated to everybody else? Here's the problem, we fixed, we've recorded the fix, and we've stopped it from happening again.
That's all problem management. Now alongside problem management, we talk about root causes analysis. Root cause analysis is all about trying to get to the underlying cause, the root cause of our problems that we manage. It's kind of the difference between just putting a bandaid on the problem and actually fixing the problem. Well we have in the sort of, science of root cause analysis we have things called causal factors and then we have the root cause. A causal factor is just anything that might cause a symptom that's undesirable.
You might say, we were able to reboot the server and get it back up and running, but that didn't tell us why the server crashed in the first place. The server stalled, it just was sitting there, and it's not working. I can bounce the box and just get it up and running, but to figure out what caused the crash, that's tracking it back to root cause, and once you track it back to root case, you figure out how to prevent this from happening again, hopefully, or you determine that you can't in some way, and then you have to take some other kind of action. Incident handling is all about dealing with incidents.
You should have a well documented policy and procedure for handling incidents as they pop up. Now it's important to understand what an incident is. An incident is one or more events that you can track that turn into a bad thing. An event is just something you can monitor or track something happening. If one or more of those events turns to be something bad, that's an incident. So how do you handle incidents when they come up? What's you're escalation process? You should have some way to handle the bad thing when it happens and make sure that you give it all the attention it needs.
You wanna make sure that you get back up and running and back to normal operations in the least effected way. So let's take a look at a incident process evolution. Typically there's some kind of notification that something happened. You determine that one or more events is an actual incident. Okay, so you perform some initial investigation to figure out, okay, we've got an incident on our hands. Then you typically try and contain the situation. Now before you do your containment, you have to know what you have on your hands and that's why you have this little bit of investigation first.
I'll give you an example. If it's just Bob down the hallway, who's downloaded some little hacking tool and he's causing some mischief, that's one thing. If it's some ex-KGB hacker guy, then you may have to totally contain the situation differently. You can't just pull the plug on the network, you're gonna have to handle that situation differently. So you're containment will depend on who you have on your hands. Then you perform some analysis to figure out what's going on. What happened on this system? How did this system get compromised? Is it running some software? Is there some rogue content running? Then you track it back to how the bad guy actually got in in the first place.
Okay, you found it on this system over here that's infected with this malware, but he must have gone in through this hanging modem over on the other location. So you track him back, then you do your repair and recovery. You fix things, you put the bandaids in, and you prevent this from ever happening again. And you feed back into the lessons learned and you record that hey, we've solved the incident, we've handled everything, etc., etc. That's a good incident handling process. That's the evolution of an incident. Help desk is another IT operational responsibility.
It's all about handling those end user or operational problems and typically it involves a tiered approach. Most help desks have sort of a lower skilled, tier one, where typically the people are just reading a script. You call the tier one help desk or support and they're gonna be reading a script. Hey did you push the power button? Is it plugged in? Is the monitor on? That kinda stuff. If your problem is serious, then you get referred to maybe a higher tier, tier two, and that's where you're typically dealing with someone much more skilled, who can actually go in and log into your machine and try and figure something out.
If that doesn't get you, than you'll probably be referred to the highest tier and that's usually where you're talking to an actually developer of a system in some way. Maybe it's a programmer, maybe it's a guy who writes the actual firmware for some router or something. And that's how sort of stagger the approach this, you have tiers. The lower skilled maybe outsourced. Then you get into the higher tiers, where it's higher and higher skilled, more expensive, support folks. IT server management is a different approach to doing IT. It's essentially where you take all of the IT operations and you sort of make all the other departments in the organization, customers to the IT shop, which is the service provider, providing services to those departments.
And they buy service as they would buy service from any other service provider. From the IT service management perspective, it's those other departments that are the customers and the IT service provider is just the IT department. Now IT service management provides two things. One is service delivery and the other is service support. So IT service management, service delivery is all about providing stuff to the customers, the other departments. Like providing an email system for them to use, providing a HR system, proving network services that they can use and bandwidth.
IT service support is all like providing the support for those systems or providing a help desk that allows people to call in and get them their needs serviced that way. That's IT service management and it's just a different approach to doing an IT department. There are some frameworks that you can use, some industry accepted best practices for doing IT in this way, this IT service management approach. The biggest one of course, is ITIL, the Information Technology Infrastructure Library, which is a British standard originally.
It's all about delivering that service in some best practice fashion. It's broken down into five different volumes. You don't need to know that for the CIS exam, but you do need to know that it is one of the key best practices for doing IT service management. Another one is from the ISO, the International Standards Organization. That's ISO 20000 which was ratified in 2011 and it's another framework or a set of best practices for doing IT service management and it's done in a sort of a PDCA, plan-do-check-act methodology.
Sometimes that's referred to as a Deming or a Shewhart cycle. It's a circular approach. You plan, then you do something, then you check, then you act, then you plan again, then you do something, then you check, and you do this with a circular approach. So ISO 20000 is a circular PDCA approach to doing IT service management. Either way, it's important to know that there are frameworks out there you can employ if you want to make your IT organization work in this IT service management way.