Viewers: in countries Watching now:
Learn how to find and manipulate text quickly and easily using regular expressions. Author Kevin Skoglund covers the basic syntax of regular expressions, shows how to create flexible matching patterns, and demonstrates how the regular expression engine parses text to find matches. The course also covers referring back to previous matches with backreferences and creating complex matching patterns with lookaround assertions, and explores the most common applications of regular expressions.
In the previous movie, we learned about the basics of positive lookahead assertions. In this movie, I'd like us to revisit some of the examples that we just saw. So, we saw that we could have a lookahead assertion for seashore, followed immediately by another expression: S, E, A, and that would match sea in seashore, but not in seaside. Then we saw that that's the same thing as if we had S, E, A as an expression, followed by a lookahead assertion for shore. So why use one over the other? Both of these would match the exact same text.
However, there are two important differences that I want us to notice. The first is the order in which the expressions are executed. In the first example, it attempts our assertion before it attempts to match S, E, A. In the second example, it tries to match S, E, A before it matches our assertion. That order can make a big difference if you are trying to optimize your regular expressions for speed and efficiency. So you will want to keep that in mind, and just be mindful of which one is executing first. More importantly, the second difference is about the position where it starts looking for each expression.
In the first example, it will start at the beginning of the string seashore, and test for our lookahead assertion. If that assertion is true, then it rewinds back to the starting position, and begins testing the main expression for a match. That's why there is zero-width, because it does that rewinding. In the process, the regex engine travels across the same territory, the S, E, A in our string, two different times. By contrast, in the second example, it matches the first expression: S, E, A, and then without doing any rewinding, it checks for our lookahead expression.
Then once that's done, it rewinds back to the position right after the A that it matched. The territory, though, of S, E, A is only traveled over one time. So why does that matter? Well, because since we are going over that same territory more than once, it allows us to match a pattern that also matches another pattern. Let me give you an example. Let's say that we have a regular expression to just match a simple 10 digit phone number. This is the way phone numbers are formatted in the United States: three digits, dash, three digits, dash, four digits.
We know how to do that. Let's imagine, now, that we want to write a different regular expression that says from the beginning to the end of the phone number, we should only have the digits zero through five, and hyphen. No digits larger than five are allowed. What we can do with lookahead assertions is we can actually put both of those tests together, and test both of them to be true. So now I have a combination expression that says, alright, look ahead and see whether or not you have only the digits zero to five, and hyphens, and then if that's true, now match the format, and make sure that the format matches.
Now, of course, you could write the first expression by defining our backslash D as being just a character set, zero to five, but that's not the point. The point is that the lookahead assertions allow us to run two different regular expression tests on the same string before it returns a successful match, and you aren't limited to just two. Because it rewinds each time, we can continue stacking these assertions. So, for example, let's say that we check that it's zero to five, but then we also check that the string somewhere has the digits four, three, two, and one in it. So now we have three regular expressions that are all being run on the same string.
The first two are assertions, and they both have to match, or it won't try the third matching expression. This is powerful stuff that let's us write expressions that we wouldn't be able to write otherwise. Let's try it for ourselves. So let's just put in some test data here. I am going to put in three different phone numbers. I have broken them onto separate lines. So I am going to use multi-line anchors, because I know I am going to be using some anchors as well, and then let's start writing an expression. So I can have backslash, D, times three, dash, backslash, D, three times, dash, backslash, D, four times. So now we've matched all of those phone numbers.
What we are going to do now is put a lookahead assertion at the beginning that's going to say that from the beginning to the end that we should have only characters zero through five, and also the dash repeated. So there it is. You see? We only found the two phone numbers which have digits that are less than zero to five. That middle one got excluded. Now let's put in our next one, and this one -- let's put an equals -- and in this expression, we are going to say that there is a wildcard that can occur zero or more times, and then four, three, two, one.
We want to find the digits four, three, two, one in there somewhere. So now notice what it's doing: it's three times it's going over that same territory. So when it gets that first phone number, it first goes over it and makes sure that it has digits zero through five and hyphens, then it goes through and makes sure it has four, three, two, one, and then it checks to make sure that it's properly formatted. On the second phone number, it tries the first assertion, and it fails. As soon as it gets to that seven, it says oops, nope, failed the assertion, move on; don't try the other two. Then it tries the third phone number.
It makes sure that the first assertion passes. So it does have digits less than five. Then it goes and looks and when it realizes that there is no four, three, two, one in there, then it stops, and it never attempts the third assertion. Let's revisit our words that are followed by commas example. I am going to open up the Self- reliance text that we had before. I'll just copy that, and we'll just paste that in down here. Now I am just going to paste back in the expression that we had before that finds all words, and we are looking ahead to see if there's a comma after it. Now, in addition to that, right after the word boundary -- so we first make sure we are at the start of a word -- then let's put another lookahead assertion.
So right here, we are going to do a lookahead assertion, and let's find all words that contain a G, H in them. So words that contain a G, H would be some word character -- we don't know how many, there might be none -- followed by the letters G and H. Do you see how that works? We are now using an assertion to make sure that it has a G and H in it, and if the word has a G and H in it, then we make sure that it has the letters A to Z and apostrophe. Then if that's true, then the last thing we do is we check to see, if we kept going, would the next character be a comma.
This expression would be difficult to write without using lookahead assertions. And even more than that, it makes it clear what our intention is. It's much easier to read and understand what we're going for, and what our requirements are. Let's take a look at another simple example. Let's say that we want to have a password, and we want to make sure that the password, from start to end, matches only any characters, and we'll make them eight to fifteen characters long for our password. And in addition, we want to check and make sure that the password has a digit in it. Well, we know how to do that now.
It can have zero more wildcard characters, followed by a digit. So if we have swordfish, it doesn't match. If I put in sword42fish, now it does match. It requires that not only the password be eight to fifteen characters long, but it also makes sure that there's one digit in it. If we want to make sure that it has uppercase letter as well, well we can just add another one. We can have character sets A to Z, and then let's put in some indeterminate characters before it. There we go.
Now there is no capital letter, so it fails. As soon as we put in a capital letter, now it matches. The ability to use lookahead assertions to double-check with multiple expressions is a powerful tool. So far we've only been working with positive lookahead assertions, though. In the next movie, let's look at negative lookahead assertions.
There are currently no FAQs about Using Regular Expressions.
Access exercise files from a button right under the course name.
Search within course videos and transcripts, and jump right to the results.
Remove icons showing you already watched videos if you want to start over.
Make the video wide, narrow, full-screen, or pop the player out of the page into its own window.
Click on text in the transcript to jump to that spot in the video. As the video plays, the relevant spot in the transcript will be highlighted.