Ready to watch this entire course?
Become a member and get unlimited access to the entire skills library of over 4,900 courses, including more Developer and personalized recommendations.
Start Your Free Trial Now Overview
 Transcript
 View Offline
Released
11/21/2011 Creating flexible patterns using character sets
 Achieving efficiency when using repetition
 Understanding different types of search strategies
 Writing logical and efficient alternations
 Capturing groups and reusing them with backreferences
 Developing complex patterns with lookaround assertions
 Working with Unicode and multibyte characters
 Matching email addresses, URLs, dates, HTML tags, and credit card numbers
 Using search and replace to format a document
Skill Level Intermediate
Duration
Views












In this movie, we'll write a regular expression to match IP addresses, and in the process, we'll explore some important points about how regular expressions work with matching numbers. Let's take a look at how an IP address typically looks. It's typically made up of four different parts; each one is a number separated by a decimal, and those numbers can range between 0 and 255. So the lowest number we can have would be 0.0.0.0, and the highest number we can have will be 255.255.255.255. And it would be some number in between there.
Typically, it might look something like this: 67.52.159.38. You can also have leading zeros in front of there. It's perfectly acceptable to have 067.052.159.038. Often they're omitted, but let's not forget about this case. So that's pretty simple, right? Each of those four parts can be a number between 0 to 255. Let's take a stab by writing an expression for it, and see what problems we run into. So to begin with let's turn on multiline anchors, since we're using multiple lines, and we'll just match a whole line by using our beginning and end of line anchors.
The simplest thing that we might try to begin with is just to say, well let's use a digit one or more times, followed by a literal decimal. Make sure that you escape it so that you get the literal decimal. Then I'll just copy that, and I'll paste that four times, and remove the last decimal there. So now we've got four digits with dots in between. Now, that definitely does match, but that's not so great, because it's pretty unlimited. I mean it matches, for example, 99999, with a whole bunch of things like that, still get matched as well. Now, you could come back, and you could limit these, and say, okay, it's going to be only three characters max, but we already know that there is the possibility it might be one to three characters.
So let's put that in there. That's a slight improvement. Now it doesn't match those, but it does still match 999, and 999. So that doesn't work for us. What we need is a way that we can actually tell it the number is from 0 to 255 each time. Now, you might be tempted to just take this digit here and say, well instead, let's just say that it's 0 to 255. That doesn't match most of our examples, and the reason why is because that's a character set.
Remember, the range that's in there is telling it to make a range between 0 and 2. So that's the literal characters 0, 1, and 2, followed by the literal characters 5, and 5. So the possibilities for that first digit now are 0, 1, 2 and 5; that's it. It's not a number range. It's not from 0 up to 255. Do you see the problem here? This is what I refer to as a number as string problem. Regular expressions treat numbers as strings. they don't have a concept about what numbers come next in a series.
It's just looking at it as if it was the letter Q, or the letter Z. So instead, what we have to do is find a way to write a regular expression that will match the text version of 255, and all possibilities that are below that. So let's just try it with the first one here, and then whatever we come up with there, we know we can repeat three other times. So for this first expression here, it's possible that first digit could be a 0, or 1, or 2. That's what's possible for that first digit. After that, the second digit could be a 0, could be a 5; it could be a 9, though.
After all, our number could be 199. So we need to allow all digits 0 to 9 here, followed by all digits 0 to 9 again. So we did match a few more things in that process. However, there's also a possibility that those are optional; that they get omitted. So in that case, we need to make both of those optional. So now we've matched our four IP addresses, but we still haven't solved our problem, and here's why. What if we instead changed this to 299.299.299.299? That matches as well. It's still above our 255 limit.
We've only solved the first digit. we haven't solved the rest of it. Now, rather than stumble through it until we arrive at the right solution, let's stop and take a methodical approach to how we can solve this. What we need to do is break down 255 into each of its parts. So, for example, we have the possibility that it's a number between 250 and 255. In that case, we know that the regular expression for it will be 2 and a 5, and the last digit would be 0 to 5; not 0 to 9. We have a very limiting factor there, because we don't want to accidentally get 256 in there.
So we want to make sure that we isolate only the numbers from 250 to 255. We also can then break down 200 to 249. Now, we've already taken care of 250 to 255, but we want to take care of everything up until then. But in those cases, we do want to allow it to go ahead and have 0 to 9 as that last digit. So it's basically the same thing we had before, but just now that last digit can be 0 to 9. And let's keep working our way down; if we go from 100 to 199, that would be 1 in the first place, followed by 0 to 9, followed by 0 to 9, and then of course, just dropping down to 000.
Then we have some optional parts. It could have one or two zeros in front of it, or it could just be a single number by itself. This will allow us to get all the way up to 99. Now, there is one point that I want to make here, which is that it is a slight optimization if we move that optional question mark from the second character set in that expression, to the last one. The reason why that's true is because of the way we saw earlier working with greediness. It's faster if we allow the regular expression engine to find an actual number first, and make anything that's optional come after that.
It reduces the amount of backtracking that it has to do. It's a minor technical point, but it is worthwhile to consider. Now, we can actually combine these last two lines. Everything from 000 up to 199 can be written as one single expression, and that's because those last digits on all of them are allowed to go all the way up to 9. So we can just combine them into the shorter version that says 01  that's optional; may or may not exist  and then the next portion after that is really the numbers 0 to 99. So now we've broken it down into these parts, and figured out how to match each portion as a string. All we have to do is assemble this together by using alternation.
So here as our first element, instead what we had before, let's just paste in our new combined string. So you can see I've got 255, 0 to 5, or it's 200 to 249, or it's anything from 000 up to 199. Now it no longer matches that 299, and of course, we want to just repeat this again. Let's just copy it, and we'll just repeat it each of these times. It makes for a very long expression, but it does work.
So now it matches 299, but it does not match even 256.256.256.256. That's the upper limit. 255 is as high as our expression goes now. So we learned not only to match IP addresses, but we learned a valuable technique about how to match numbers as strings using the regular expressions. Keep this technique in mind, because we're going to be using it in the next couple of movies as we look at matching dates and times.

Public Link
Video: Matching IP addresses