Join Kevin Skoglund for an in-depth discussion in this video The wildcard metacharacter, part of Using Regular Expressions.
- View Offline
We are ready to take a look at our first metacharacter, which is the wildcard metacharacter. I'm sure at one point you've played a card game where one of the numbers in the deck was wild. So let's say that the 2s are wild. Well, then if you have three aces and a 2, it's the same thing as having four aces, right? The 2 can match anything. That's the idea here behind the wildcard metacharacter. The metacharacter is the dot, the period by itself, and that matches any character except for a new line. Now you maybe wondering, well, what's up with this new line thing? Well, it's because the original UNIX regex tools were line-based, so we're going to have to talk a little bit more about how to deal with new lines.
But most of the time the dot is going to match any character except for the newline character. So for example if we had h.t, that's the same thing as h wildcard t, and it will match hat, hot, and hit, but it will not match heat. It's only one single character. The same way as if we were playing cards and we had a 2, the 2 can't be a stand-in for three different aces; it's a stand-in for one ace. That's it.
It matches one single thing. This is the broadest match possible. It matches just about everything that it could be. It is probably the most common metacharacter used, and it's also the most common mistake that people make. You might have a regular expression 9.00, and you're thinking that's going to match $9, that it's going to match 9 with two decimal places after it. Well, it does. But it also matches 9500 and 9-00. Do you see how that works? It matches not only the period but these other things, because it's a wildcard.
It can match absolutely anything. Now we'll learn how to fix this mistake in the next movie, but it illustrates an important truth about regular expressions, and that is that the challenge of regular expressions is both in matching what you want and in matching only what you want. You don't want to be overly permissive about what you let through. You want to find the thing you're looking for, but only that thing; you don't want any false positives. Now let's try some of these. Let's try h.t to begin with. We've had that one before and for our test data, let's try hat, hot, hit, heat, hzt, h t. See, it matches all of those things.
It even matches the space. It didn't match heat. It said no, no match, because the regex engine went through and it said, h is a match, e, that's a match, but then the next character should have been a t, and when it wasn't, that's when it said, ah, that can't be right. And we could have h#t, h:t, any of those things will match. It matches absolutely anything-- punctuations, symbols. Before we go on let, me just show you the line return thing. If I type h and then I'll hit a Return, followed by a t, notice here I have Dot matches all. It's the s option.
Dot matches all does match it. That takes away that restriction on the line return. That's a feature that was added later. So now instead of saying the dot matches any character except line return, now with the s mode, it's dot matches any character including the line return. Most of the time you're not going to use that, but I just wanted you to see what it was. Let's try some other examples. Let's try this regular expression, .a.a.a, just three of them. There we go. Three dots each, followed by an a. And now let's try banana; it matches.
Let's try papaya; it matches. Do you see that? So notice here we've now written a regular expression that matches both banana and papaya, because what do these two things have in common? The one common trait is that each have some character followed by an a. It wouldn't matter if those characters were something crazy. Let's say it's #a$a#a, it still matches that as well. We're starting to write expressions that start to zero in on common traits in text.
It does not, for example, though match abacab. Actually, it does match the space in front of it. Notice that. It does not match abab. Notice where the space is. This is not the same thing as if we had a.a.a. Now it matches something different. So as an experiment, on your own try, writing something that will match silver, sliver, and slider. See if you can use wildcards to come up with a regular expression that will match all three of those.
- Creating flexible patterns using character sets
- Achieving efficiency when using repetition
- Understanding different types of search strategies
- Writing logical and efficient alternations
- Capturing groups and reusing them with backreferences
- Developing complex patterns with lookaround assertions
- Working with Unicode and multibyte characters
- Matching email addresses, URLs, dates, HTML tags, and credit card numbers
- Using search and replace to format a document