Easy-to-follow video tutorials help you learn software, creative, and business skills.Become a member
In this movie, we are going to use regular expressions to help us identify words when they're in close proximity to other words inside a text. This is more powerful than a typical Find & Replace, where we can simply say find this word when it's in front of this word. Now we can say find it, and there may be some space in between. There may be a few other words thrown in there. We want to know if it's nearby. So to start with, let's get a text to work with. In the exercise files I have given you Ralph Waldo Emerson's Self-Reliance text. I'm going to put it in RegexPal, and we are not going to use our multi-line anchors this time like we've been doing for all the other ones, because we are going to be checking for things that are not anchored.
So you can check for any two words inside this text. The words that I am going to look for is occurrences of A -- just little simple letter A as a single word -- and man. So any time we have a man. So we want it to find a man, but we also want it to find here where we have a perfect man, and we have the word perfect in between. So let's write a regular expression to do that. Let's put in A, and I am going to put in a character set here, because it can be capital or lowercase. And then after that, wildcard. Wildcard with zero or more characters. And then man also can be uppercase or lowercase; it can also be capitalized or not, and now you can see that it made a bunch of matches for me.
You can scroll through these, and see all of the matches that it made. Why did I use the star here? You don't have to; you could use the plus sign. If we were looking for two words that could be right next to each other, like peanut butter, and we were looking for peanut, and butter, and we cared that there could be nothing in between them, well then you'd want to use the star. I think most times for what we were doing here, though, we're not looking for A, M, A, N, when it's all run together, we can use the plus sign. We can also make it a little more readable by using grouping, just to separate those words out. And if you do that, you also, of course, want to make this a non-capturing group, because that's going to be a little more efficient.
We could use the Dot matches all mode. Here we have got this dot repeating. There is Dot matches all, which we can just click this checkbox for; it's the S modifier. And if you remember, that wraps across line breaks as well. So that means that the dot can now match a line return, otherwise it typically doesn't. Now once we turn that on, we see another problem, which you may have noticed before, which is that it's actually matching from the first A that it finds, all the way down to the last man that it finds. Here it is, because it's being greedy. We're seeing greediness in action here.
So we need to make this not greedy, and that will make it find the next occurrence of that second word -- in our case man -- that it can find. It's also finding A when it's not just by itself. We can use word boundaries to further improve it. Backslash, B, and backslash, B here, and backslash, B, man, and backslash at the end. Now it's finding just when it's a whole word. Now, obviously that wouldn't find peanut butter anymore, but we had already made that choice, and decided that we didn't care about finding those when they were right next to each other.
If you really did need that behavior, you could always put an alternation in here, and say that it's also possible to find the two words when there are immediately side by side. So that does it. Let's scroll down here, and let's look at our list. We have got a perfect man, down here is a man, a certain alienated majesty blah, blah, blah. Boy! That's a long one until we finally get down here for man. I probably don't actually want this Dot matches all, so let's take that away. That at least makes it a little better. Now it found this a man, and this one here. Scroll down; a divine man, a dinner, and -- boy, that's a lot of stuff before we finally get down here to this man.
Let's further improve it, so that we don't get that long thing. We could say that this could be any character except something like a period, or comma, or semicolon, and so on, and now it no longer finds that match. It does still find our other matches up here. Another improvement we can make to it, is to say, well we want to find any time it's within a certain number of characters. Right now we said we don't care how many characters are in between. We could put a quantifier on this, and say, well it actually can be between 1 and 20 characters long.
So any time it's between 1 and 20 characters, we'll find it. Or you could adjust it then, and say, oh you know what? Actually let's extend to 30 characters, or let's dial it back down to 10. That can control how much space is actually between these two words. Probably, though, we don't care about characters as much as we do, maybe, how many words are in between. So we could modify what's in between here, so that it would actually use words instead. So let's take all of this out here, and let's rethink this for a second. It's still going to be a non- capturing group, but the way that we identify a word is that it's a space, followed by word characters, followed by either a hyphen, or a space.
Now, you could include punctuation in there if you wanted to allow it to cross those punctuation boundaries. So now we've defined a word; now what we want is to repeat it. I am actually going to take the space out of the front, and put the space here, so it's going to be A, space, and then a word with a space or a hyphen, and then that will be repeated, and we can make it repeated, let's say it could be 0 to 5 times. So there can be up to 5 words, now, in between our two target words. And again, you can adjust this, and dial it down, and say, alright; I only want to allow one word in between, or I only want to allow three words in between.
Let's just try a perfect old man, and then we can see the difference. We do 2, and we do 1; now suddenly it's now allowed. We have to have 2 inside there. And then last of all, remember that we can use lookahead assertions to control what actually gets matched. So, for example, if we are interested in matching the A here, and that's what we really want to focus in on, then everything that comes after it, we can just wrap all of this inside a lookahead assertion. Question mark, equals; there we go, and now it matches just the A. You can also use captures if you want to capture just certain parts of it to be able to work with it during a Find & Replace.
Now here is the thing; this checks for A when it comes before man, but if you are looking for the opposite; if we wanted to find man before it came for A, or word two in front of word one, you'd want to flip the order around and search again, because it's very hard, almost impossible, to use lookbehind assertions in this case, because the length of the words in between there is indeterminate. And one of the restrictions on using lookbehind assertions in most regular expression engines is that it can't use variable lengths. So because it's an indeterminate length, we won't be able to use lookbehind expressions.
So if you wanted to match the other way around, you just have to flip the words, and search a second time. I think this can be a very powerful and useful technique. Don't think that it's just for essays either; it can also apply to searching inside code as well.
Get unlimited access to all courses for just $25/month.Become a member
61 Video lessons · 96658 Viewers
56 Video lessons · 110372 Viewers
71 Video lessons · 79199 Viewers
131 Video lessons · 37982 Viewers
Access exercise files from a button right under the course name.
Search within course videos and transcripts, and jump right to the results.
Remove icons showing you already watched videos if you want to start over.
Make the video wide, narrow, full-screen, or pop the player out of the page into its own window.
Click on text in the transcript to jump to that spot in the video. As the video plays, the relevant spot in the transcript will be highlighted.