Join Kevin Skoglund for an in-depth discussion in this video Defining a character set, part of Using Regular Expressions.
- View Offline
In the last chapter we learned about matching single characters and we also saw our first metacharacter, the wildcard. In this chapter, we'll talk about character sets. In a way, the wildcard character is a character set too; it's just a character set that matches all characters, or it's a character set of all characters. As we saw, that results in really broad matches. What we want to do instead is to narrow our expression so that it matches less. Remember, the two tricks to regular expression is in matching what you want, but also in matching only what you want, so we need the ability to be more specific about what should match so that it doesn't just match everything indiscriminately, and the way we're going to that is by learning a couple more metacharacters that will help us to define a character set, which are the open and closed square braces.
These square braces indicate a character set which will match any one of several characters, the characters that are inside the set. But it's very important, it will match only one character. The orders of the character inside the set do not matter; it's just about these are the items that can match. So for example if we have A, E, I, O, and U, that will match any one vowel. That's it. Let's say we have it inside a word, like gr, and then in square brackets, eay. That will match a literal g and r and a literal y, and there can be one letter in between, and what can that letter be? Well our character set tells us it's either an e or an a, so that will match grey with an e or gray with an a.
Now notice that great does not match the word great. Don't be thrown off by that; it's a single character. This will still match a four-letter word: gr something, followed by a t, and that something has to either be an e or an a. Now these brackets are going to be a really big source of power for regular expressions, because we can be very specific about what should be allowed in that spot instead of just having that big open wildcard. Let's take a quick look at how the regular expression engine parses this kind of regular expression.
So once again, we have our sentence "The cow, camel and cat communicated," but this time our regular expression is not C-A-T, it's C followed by character set that can be A-E-I-O-U, and then a literal T after it. So of course it starts at the beginning, and you can assume it will move along character by character through there-- we've already talked about how that works-- until that finally gets to this C. When it gets to this C in camel, it matches, and says okay, I've got a literal C there. Let's move to the next character. Is this character in that character set? Is it one of the characters that's been defined? Yes, it is, so now it moves forward to the next character and says is that the literal T that comes after the character set? It's not, so then it backtracks to the A and now it says, all right, is that the C that I'm looking for? It's not, so it keeps moving along, and it works its way down till it finally gets to word cat, and then it finally makes the matches. It says ah! Here is an A that's inside the set. Here's the literal T. Now I have a match.
Then of course it do the same thing as it moved through communicated. So the process of the match is still the same thing; it's just now that we have this character set, it's going to use the set to see if something should match instead of a literal character. Let's try a few out. So let's just try our examples there. Let's say we had A, E, I, O, and U, so we're going to match any one character. And I'm going to try bananas and peaches and apples. Now notice here that it matched the A, the A, and the A. That's because I have Global turned on, right, so it matches all of them. And then in Peaches, notice that the EA, it matched two times.
The colors let you know that it's actually two different matches. The E matched and the A matched. It's not matching E and A together; it's only one character. Notice also that this A here is not matched--the capital A in Apples. This is case sensitive unless we checked that. Same thing is true inside character sets. A now it suddenly does match A, E, I, O, U would match it regardless of whether it's uppercase or lowercase. All right, let's try with gray, G-R-E-Y and G-R-A-Y, and let's change our match here so that now we're going to match for gr and y on the outside, and in here we're going to look for E or A, right.
So now I can match anything that is E or A. It doesn't matter if we have more things in there. B, C, D, right, it doesn't make any difference. Notice it also doesn't make a difference what order they are in. If I have A or E, that doesn't make any difference either. Another one, let's try with great. That was the other example we had. So if we have great, and we have, let's put great here, notice that it does not match. If we want it to match, we would have to have another character here. We could to it that way. Now it does match: GR followed by one character which is an E or an A, followed by another characters that's an E or an A, followed by a T. See how that works? Be careful though. What this also does match of course, is graet, greet and graat, right? See why that's true? Those are the different combinations that we can come up with by doing that.
So just be careful about when you build this to think about the things that it might also match the different combinations that you might be able to come up with. Let's try one last one. Let's just use a string. We'll just say Hello, and then up here let's type in--have something that will match that. To match the capital letter that might be at the beginning of the word, we're going to match it with ABCDEFGHIJKLMNOPQRSTUVWXYZ--wow! There we go-- followed by some string, which is in this case is Hello. Wow! That was a lot of typing to get all of these uppercase letters, right? If I wanted to do the same thing for this letter, I'd had to have to do it all over again with lowercase letters.
Fortunately there's a much simpler way to do that, and we could do that with character ranges inside our character sets. We'll take a look at how to do that in the next movie.
- Creating flexible patterns using character sets
- Achieving efficiency when using repetition
- Understanding different types of search strategies
- Writing logical and efficient alternations
- Capturing groups and reusing them with backreferences
- Developing complex patterns with lookaround assertions
- Working with Unicode and multibyte characters
- Matching email addresses, URLs, dates, HTML tags, and credit card numbers
- Using search and replace to format a document