Viewers: in countries Watching now:
Learn how to find and manipulate text quickly and easily using regular expressions. Author Kevin Skoglund covers the basic syntax of regular expressions, shows how to create flexible matching patterns, and demonstrates how the regular expression engine parses text to find matches. The course also covers referring back to previous matches with backreferences and creating complex matching patterns with lookaround assertions, and explores the most common applications of regular expressions.
In this movie, we'll work on developing regular expressions that can match HTML tags. HTML tags typically look something like this. We've got an opening tag, which is made up of a less than sign, and a greater than sign, on either side of a word. You can also think of those as being angle brackets, and that's the opening tag. Then there is some text that's in the middle, and then after that is the closing tag, which is just like the opening tag, but with a forward slash right before the word, and that's what an HTML tag typically looks like. The content to that tag can vary, depending on its purpose, so we can have a strong tag; we can also have an em tag to Emphasize something.
We could have b for Bold, or i for Italics, and just have a single character there. Now incidentally, bold and italics have been deprecated in favor of using strong and em in their place. So it's a very common task with regular expressions to scan across an entire Web site, and replace all of the bold tags with strong tags, or all of the i tags with em tags. So it's a perfect case where you might find yourself using regular expressions to do it. Now, a single word is not the only thing that can go in that first tag. We can also have attributes, and the way we have an attributes is we have a space, and then we have the attribute name, an equal sign, and then in quotation marks, if it's properly formatted -- you may see it without quotation marks if it's not properly formatted -- then we have the value for that attribute, and then a space, and the next attribute value pair. And there can be many of those: id, class, style; there are lots and lots of different possibilities.
And last of all, let's don't forget that there are some tags that can be self-closing tags; that is, they don't have any content in the middle, and they don't actually have a closing tag. They close themselves by just having a space, followed by the forward slash, and then that angle bracket at the end. HR, for horizontal rule, is a good example of that. Let's try writing some regular expressions that'll match these. So the first thing we want to do is turn on multi-line anchors, like we have been doing in the other movies, just to make sure that we match each line individually. Now, let just try matching that first one. A lot of times the way I find that it's easiest to match is just to copy that value, and paste it up there.
Obviously, that's going to match, because it's a literal word for word string, and then we can start playing with it, and see if we can adapt that a bit. So for example, here in the middle, we really don't care with this text is in the middle, so we can just say that that's going to be any character, followed by an asterisk, and we'll make it not greedy. Now, I'm going to put parentheses around it, just to sort of keep it separate from everything else that's we're doing, but you don't have to. Okay, so that still matches our example here, but it doesn't match any of our other ones. We need to add more flexibility in here, so instead of just having strong here, let's put parentheses around this, and let's make it into an alternation. Let's say they can also be em, for example. We can do the same thing here at the end, and we'll make this also equal to em, parentheses.
So now we've matched the first two tags, but we've actually introduced a problem that we may not have realized, which is that we could also have strong, mistake, and then, em, and that's a perfectly valid tag. Our tags are no longer balanced, and we're still getting a match. We want to make sure that we do have balanced tags, especially because it is possible to nest one tag inside another. We don't want to mistakenly grab the wrong tag. That's why we also use the lazy operator here after our star, was to make sure that we didn't consume too much. We grabbed the next tag that matches.
So one way we can do that is by using backreferences; we talked about those before. Backreference here; we are already capturing this group here. We've got parentheses around it, so it's being captured for us. Backslash 1; we'll now grab whatever value is found there, and reuse it again, as if we were copying it and pasting it into this spot. So now it matches strong and strong, and em and em, but not the weird combination where we had strong, and em after it. Okay, so what about our other ones down here? Bold, italics; we could just keep going. We could just keep listing all the tags that we wanted to define here, and certainly if we were trying to find certain tags, then that would make sense. We would want to itemize the tags. But there is a lot of HTML tags that are out there.
What I want to write is something that's more general; something that will match any HTML tag. So we know we could do that by just changing this to be a wildcard, make it plus, we could even make it lazy, just like we had before, and that'll make sure it doesn't grab too much code, and that works. That actually does match our first four. But there's an even better way that we can approach this. Instead of just having this wildcard, which always feels a little bit sloppy to me; you always want to be careful with that wildcard. Instead, I'm going to change this definition here to say, well you know what? Actually, this can be any character, but it can't be that closing angle bracket.
As long as we haven't hit a closing angle bracket yet, go ahead and keep using it. It's a little bit faster, because it does ensure every time, have we got to the closing angle bracket yet? No. Have we got to the closing angle bracket yet? No. And then when we finally do hit one, then it says, okay, we finally are at the closing angle bracket; we're ready to move on to the next match. And I don't allow this to be a lazy quantifier any more, because I've been a little bit more specific here about what I'm looking for. Now, notice that it's not matching the next one. Why is that? Well, it's because when we have these attributes in there, they break our backreference.
Now this whole bit right here is being captured, and being used in the backreference. So if this also had id and class after it, well then we might get a match, but we don't. What we need to do is separate, then, this first word from any attributes that might appear after it. So one good way to do that is say, alright, what could this first word be? Let's go ahead and move our capture to that. Let's don't try and capture this. We're still going to use this for anything that might come after it, but for this first word -- the part that we actually want to capture -- we're going to instead say, well that can be any letter, A to Z, and that does indeed have to be a letter, and then followed by a second character, which can be A to Z, or lowercase a to z, or 0 to 9. It's essentially any word character without the underscore. That's because there is a tag called H2.
That second one can occur zero or more times, and then after that, just to make sure that we've actually ended that word, let's put that there should be a word boundary there as well. So once we get to the word boundary, we'll know we've got the first word, and we can capture it, and then we've got our second part here that takes care of all of those attributes, and then our center section, followed by our tag at the end. So now you can see that our backreference is working, but in the process, we broke it for all of these up here. That's because they don't have these attributes up here, so we have to say it's zero or more times.
It's a possibility that there are no characters after the word boundary. The very next thing after the word boundary might be this angle bracket. So finally now, we got it matching all five of these. Alright, last of all, let's look at this possibility for a self-closing tag. Now, you certainly could just search for that separately, or you might want put in an alternation, let's say, inside this tag. Let's put it right here. All the way to this -- we will do it right here; we'll make that the alternation. So this one could be hr, slash, with a closing angle bracket, and now it matches as well.
Now, let's revise this a little bit. Let's don't be so specific about hr, because there are other possibilities that can be there as well. What we can actually do is just grab all of this here from this beginning section, and let's copy it. There we are, and then it actually can have attributes too. We don't need to capture this time; we don't need to do a backreference, so let's just grab this a bit here that may or may not exist before we get to the end. Okay, so now we have the possibility of having attributes on our hr tag. But what happened to our other matches? Why did they stop working in this process, right? There's a simple alternation here; why did this suddenly stop working? And if you want to try it, take this back out, just this alternation, and you'll see that it does work, and then when you put it back in, it stops working again. Can you guess why? It's because we have this backreference here referring to capture number one.
What is capture number one? What's the first thing being captured? It's these parentheses that's grabbing the entire match. That's what's being referenced. So what we need to do is put a non- capturing group here at the beginning. Now it says, alright, don't capture this one, this is just an alternation. This is my first capture again, and so that what the backreference now refers to. Be careful about that when you start adding in these parentheses. In fact, it's not a bad idea to go ahead and make these non-capturing as well, unless we're actually trying to capture it.
So now we developed one regular expression that will match all of our HTML tags.
There are currently no FAQs about Using Regular Expressions.
Access exercise files from a button right under the course name.
Search within course videos and transcripts, and jump right to the results.
Remove icons showing you already watched videos if you want to start over.
Make the video wide, narrow, full-screen, or pop the player out of the page into its own window.
Click on text in the transcript to jump to that spot in the video. As the video plays, the relevant spot in the transcript will be highlighted.