Viewers: in countries Watching now:
Learn how to find and manipulate text quickly and easily using regular expressions. Author Kevin Skoglund covers the basic syntax of regular expressions, shows how to create flexible matching patterns, and demonstrates how the regular expression engine parses text to find matches. The course also covers referring back to previous matches with backreferences and creating complex matching patterns with lookaround assertions, and explores the most common applications of regular expressions.
In this movie, we are going to learn about another kind of anchored expression, and that is using word boundaries. The metacharacters we are going to use are the lowercase b, and the uppercase B, with a backslash in front of them. Lowercase b is a word boundary; that is the start or the end of a word. The uppercase B is not a word boundary. And that's the same pattern we have seen before, like when we had word character. Lowercase w was a word character; the uppercase W was not a word character. Just like the other anchored expressions we have seen, they reference a position, not an actual character.
What decides whether or not something is the boundary is based on a couple of conditions. It is the first word character in the entire string, so that's the first boundary you are going to have, and at the end of the string, the very last word character is going to get boundary as well, and then after that, in between those two, every single time that it shifts between a word character, or a non-word character, we're going to have another boundary. Remember, word characters are the capital letters A to Z, lowercase a to z, 0 to 9, and the underscore.
Same thing as that shorthand character class, backslash w. So anytime we switch between one of these things and something that's not one of these things, we have got another boundary. Support for these metacharacters is going to be in most regular expression engines, but not in the really early UNIX tools; the BRE's. Essentially what that means is you can use in egrep, but not in grep. Let's take a look at some examples. Just finding a simple word we might say, alright, we have a boundary, followed by some word characters, followed by another boundary.
And when we apply that to a sentence, it will find four matches in the string: this is a test. This, is one of them; is, a, and test. No spaces, no punctuation gets matched; just the word characters with the boundaries on either side. If we had a abc_123, well they would match the whole thing, because remember, underscore, and 1, 2, 3 are word characters. But in top notch, well in that case there's actually two words, and four boundaries. There is a boundary before the T, there is a boundary after the P, a boundary before the N, and a boundary after the H. So we come up with two words: top, and notch.
Now, those are boundary examples. You can also have not a boundary. These aren't necessarily quite as useful as the boundary once, but sometimes it can be. If we wanted to find capital This, when it was not at a boundary for some reason -- there was something in front of it -- then we would use the capital B, and we would not have a match in the case of just this is a test, because the first character in the string is counted as a boundary. It would find two matches if we used that same pattern matching of Backslash W with the plus sign. Those two matches would be H and I inside this, because neither of those characters has a boundary on either side of it.
Every other character does have a boundary on one side or the other, and E and S in test. Let's take a look in RegexPal. I'm going to paste in a Shakespeare sonnet here, just so we have some text to work with. This is in the exercise files. And then, for a word, let's just try that simple one. We are looking for any word character repeated, with boundaries on either side. So you can see what it picked out as being the words that have the boundaries there. Each of the words -- not the spaces, not the punctuation; those are not counted.
Now, if we wanted those, we could just put square brackets around this. Let's say, for example, we wanted summer's to be included, so we put an apostrophe in there. It is including it in our match. It still doesn't mean there's not a word boundary there. There is still a word boundary after the word summer, and then another one after the apostrophe before the letter s. How do we know that? Well let's make it not greedy. If we put not greedy after it, you can see now it didn't count it, but when it's greedy it said ah! I'll go ahead and just keep consuming things, and ignore word boundaries, and keep consuming characters as a match to my pattern until I run out of things that are word character, and then I'll check to see if there's a word boundary.
This way it keeps checking constantly; every single time it consumes another character, that lazy expression makes that check again. Let's just try another example that we've seen before. Let's take this back out. For starters, let's just do this, with an s after it, and let's do We picked apples. We did that before, and it matches all plural words; everything that's word characters, followed by a literal S. Before, we talked about the efficiency of that, and the efficiency of using repetition. One way we can really improve the efficiency of that is put the boundaries on either side of it. Say we are looking for a whole word; don't look for things that are partial words, don't waste your time with those, only zero in straight on the whole word, and that does give us quite a bit of speed improvement.
Let's look at it. So, if the parser is going through the sentence, We picked apples, it starts with the W, and it says alright, I've got my word boundary. That condition's met. Do I have my second condition? The second expression, which is this repeated word character -- I do. Now it goes the E; that's a repeated word character. Now it goes to the space, and says alright, that is not a word character. It jumps to the next character, and it says it's not an S, so it fails to match, and it backtracks. But the E; it says alright, do we have a word boundary here? And actually, we don't really have one there. The word boundary really occurs right after that; there's an ending word boundary, so it actually waits until it gets the next character. Right there it says alright, we have got a word boundary, but this is not a word character, so it keeps moving until finally it finds the condition again where it's a word boundary, followed by word character; picked.
It works its way along, just like we saw before. This time, when it gets to the space it says, this is not a word character; this is not an S. So it does backtrack to the I, just like it did before. No difference here, but here is where it changes. It's not trying to match a word character; it's looking for a word boundary. No word boundary, no word boundary, no word boundary, no word boundary, no word boundary, ah! I have a word boundary, but I don't have a word character. Now I have a boundary, followed by a word character again.
Do you see all that backtracking that it skipped? It no longer backtracked and tried icked, cked, ked, ed, d; it left all that out, and just went 'til it found the start of the next word. Much more efficient. And then, of course, it works its way along until it matches apples. Now, there is one important word of caution that I want to give you, and that is that a space is not a word boundary. In regular grammar, the purpose of having a space in a sentence is to denote the boundary between words so that all the words don't just run together.
But in regular expressions, that's not the way it is. A word boundary references a position, not an actual character; it doesn't represent that space. So, for example, if we had the string apples, space, and, space, oranges; It does not match if we have apples, boundary, and, boundary, oranges: that is not a match. The way it would match would be if you had apples, boundary, space, boundary, and, boundary, space, boundary, oranges.
It's easy for us to think of spaces as being boundaries, but a word boundary is something different in regular expressions. It really is the point at which it switches from a non-word character to a word character.
There are currently no FAQs about Using Regular Expressions.
Access exercise files from a button right under the course name.
Search within course videos and transcripts, and jump right to the results.
Remove icons showing you already watched videos if you want to start over.
Make the video wide, narrow, full-screen, or pop the player out of the page into its own window.
Click on text in the transcript to jump to that spot in the video. As the video plays, the relevant spot in the transcript will be highlighted.