Viewers: in countries Watching now:
Learn how to find and manipulate text quickly and easily using regular expressions. Author Kevin Skoglund covers the basic syntax of regular expressions, shows how to create flexible matching patterns, and demonstrates how the regular expression engine parses text to find matches. The course also covers referring back to previous matches with backreferences and creating complex matching patterns with lookaround assertions, and explores the most common applications of regular expressions.
In this movie, we won't be introducing another metacharacter. Instead we're going to be talking about an important principle of how regular expressions work. It's called greedy expressions. Often the regular expression engine has to make a choice about what it's going to return as a match. That becomes a especially true now that we are using repetition expressions, because now our strings are of an indeterminate length. Greediness is really just a term to describe how the regular expression engine makes that choice by default. Let's look at some examples so we understand what the problem is. Let say that we have an Excel file called 01_FY_07_report_99.xls, and we just have a real simple regular expression that says match some digits, followed by some word characters, followed by some digits. And don't forget, word characters can be letters, numbers, or underscores.
So the question becomes, does the engine look at this and say, ah, I see a match-- it's 01_FY_07--or does it match the entire thing, all the way up to the .xls? Let's take a look at a second example. Let's imagine that we have a comma-delimited text file that has people's first names, their last name, and their company. So first name in quotes with a comma space, last name in quotes comma space, company name in quotes at the end. Well, what if we had a regular expression that looked for any character inside quotes, comma space, any set of characters inside quotes.
Would the regular expression engine returned to us the first name and the last name? That might be what we're expecting. Or would it return the last name and the company, or would it return all of it, in the case that maybe one of those wildcards with the plus after it actually could include the quote, the comma, the space, and the quote in between those two? I think you can see the problem and remember, these are not complex regular expressions, and we can already see the choices that it has to make. Imagine what happens when our expressions become complex. Well, the answer to the question about what it's going to match is that standard repetition quantifiers are greedy.
That means that the expression tries to match the longest possible string. And when I say the expression, I don't mean the entire expression; I mean the repetition-quantified expression, all right, so that one part of the expression tries to match as much as it can. Of course it's still is going to defer to achieving an overall match. So for example, if we had a filename.jpg and we were searching that for some wildcard characters.jpg, it wouldn't do us any good if the wildcard character was so greedy that it said all right, this entire file name that matches me, I match an F, I match an I, all the way down till it gets to the G, and then it say oops, but I didn't make a match overall because of the .jpg--that wouldn't do any good.
So the plus is greedy, but it gives back the JPEG at the end to make the match. You can think of it as rewinding or backtracking to make sure that it gets the match. So in this case, the wildcard that's get repeated would match filename and then it would move to the next part of expression to match the .jpg. Now, even though it does give back that portion, it's still greedy. It gives back as little as possible. So for example, let's say we had a string Page 266, and we had a wildcard that was repeated, followed by some digits that were repeated.
It doesn't say oh, it would actually be really nice of me to include all the digits in this digit expression, right; it doesn't make that distinction. It doesn't say somehow that that the digits are more superior to be grouped together in one group than these other wildcard elements are. It doesn't do that. It parses through it item by item. The wildcard character matches Page 266 and then it gives back only what it has to to make the match, which is that final 6. Let's actually look at the way it parses it, and I think that'll become clear. So let's say we have that exact expression and the string as Page 266.
The regular expression engine starts at the P and says ah! Does this match my wild card character? It does. Great! While, it's a repeated wildcard character, so let's see if the next one matches too. It goes to the a, and says yep, that matches my wildcard too. It goes to the g. That matches the wildcard and then the e, and the space and the 2 and the 6 and even the last 6. And it says ah! These all still match the wildcard. Boy, I'm doing great here. And then it gets to the end it's says, oops, I got to the end--I didn't get a match, so I probably shouldn't have been quite so greedy.
It knows that it had success with that first part of the expression, but it still didn't make an overall match. So it says, what if I was a little bit less greedy? What if I were to just go back one character and I gave that one to the next part of expression and see if that makes it match? So now the wildcard is matching just Page 26, the 6 then goes to the second part of the expression 0-9, and says yup, that works. Now I have a match and I'm completely done. Now if it hadn't made a match there, it's essentially the same thing as if it had been a wildcard with a 6 at the end.
Now if it hadn't found a match there then guess what it would have done next. It would have back tracked one more step and then it would have backtracked one more step, keep scaling back its greediness to see if being less greedy would allow the rest of the expression to still match. So to go back to our original examples and take a look at those, the answer is that it would match the entire thing, and that especially can throw you off. Especially in that second example, that catches a lot of people, because they think oh I'm just looking for the first and this last thing, but the thing is is that your wildcard is so broad that it is able to match so many things and that greediness kicks in and it just keep consuming parts of the string, so that the first part is being matched by the first name and last name, comma space, and then that second wildcard is matching the company name.
So we've already seen one important principle about regular expressions, and that is that regular expressions are eager. Now we have seen the second one, which is that regular expressions are greedy, and that make sense that the two of these go hand in hand. It's eager to give you a result, so what it does is it tries to just keep letting that first one do all the work. While we're already in the middle of it, let's keep going, get to the end of the string and then when it doesn't work out, then it will backtrack and try another one. It doesn't backtrack back to the beginning; it doesn't try all sorts of other combinations. It's still eager to get you a result, so it says, what if I just gave back one? Would that allow me to give a result back? If it does, great, it's done. It's able to just finish there.
It doesn't have to keep backtracking further in the string, looking for some kind of a better match or match that's further along. So that's what the concept of greediness is. So don't forget, by default, regular expressions are eager and they are greedy.
There are currently no FAQs about Using Regular Expressions.
Access exercise files from a button right under the course name.
Search within course videos and transcripts, and jump right to the results.
Remove icons showing you already watched videos if you want to start over.
Make the video wide, narrow, full-screen, or pop the player out of the page into its own window.
Click on text in the transcript to jump to that spot in the video. As the video plays, the relevant spot in the transcript will be highlighted.