Easy-to-follow video tutorials help you learn software, creative, and business skills.Become a member
In the last movie we saw how Unicode allows us to represent multibyte characters, which is useful when we need to use characters that are outside of the Roman alphabet. So while Unicode is a great way of representing the characters, it does create complications for regular expressions. First of all, there's the simple fact that words can be spelled in multiple ways. You could have cafe with or without the accent. So if we were searching for only one of those, we might miss the other spellings. And then more importantly, they can be encoded in multiple ways. Cafe can be encoded as either four or five characters.
We need to be able to look for both of those character encodings. And then that creates problems then for wildcard matching. Remember that when we did our wildcard, we set it match any one character. Well, what happens if something is encoded as two characters, then what is our wildcard supposed to do? It also creates some issues for backtracking as well. If we're moving to the string cafe and we get to the e and it's not a match, then does it back up one character or two characters? That's something that we may not think about but that the regular expression engine has to deal with. And then perhaps most importantly is just the fact that Unicode is relatively new.
This was a pretty recent solution to this problem that grew over time. The regular expressions go back to the late 1960s and the birth of UNIX. They've been around a lot longer. So how do we handle these in regular expressions? Well, there's a Unicode indicator, that's the backslash u, that indicates that we're about to work with a Unicode character. So we have \u followed by a four-digit hexadecimal number. So that's 000 through FFFF; those are the possibilities.
Perl and PHP support it, but they use the lowercase x instead of the u. So you can do the exact same thing, just a different character; instead of u for Unicode, that use x. It's not supported in older UNIX tools. So in any of those old tools before Unicode came about, they're not going to support it for the most part. Let's try it out. I think it'll make more sense to you. Inside regexpal, let's start by just entering regular expression. Let's just look for a cafe like this and then we're going to type in cafe, cafe, and this time I want the accented e, which is going to be the Option key, e, followed by the e again.
So that it tells it hey, I'm about to do an accent, and then you actually do the accent. And then after that, let's type it one last time, but this time I want to type it as 2 bytes, cafe, and then right after that I want to type the accent. Now this is a little bit tricky, but I'm going to show you how to do it on a Mac. On the Mac, under the Edit menu, they give you Special Characters as an option. If you choose that, it pops up all the possible characters that you could type. Now I've got it set to Code Tables here, so that it shows me the character numbers.
You'll see, if you scroll down till you get to the 0300 row, you'll see that here's that accent. If I just double-click on it--that's Unicode 0301. Double click on it and you'll see that it added that accent over it. Now it looks the same, but it's actually two different spellings. Now you can see the problem that it created for the regexpal engine, for its code coloring. It told me that this is a match even though it's not. Why is it a match? Well, it matches those first four characters, even though it doesn't match the fifth one. You see how that works? So let's try, here now in the regular expression, let's change this, and let's instead say that we're looking for u followed by 00E9.
Now look which one it picked. Now it said, all right, I'm looking for this specific encoding. It didn't match the second one, just that first one. Let's try the other encoding, 0065. Remember, I told you that's just a plain-old e. That's the character encoding for that. If we want the accent over it, we have to do u0301 after it. Remember, 0301 we saw in that special characters--that was the number for what it represented it. Now it matches just this version, not that version. So you can see why it's tricky. You see why this creates problems.
The main thing I wanted you to get out of this was I wanted to introduce the concept to you for you to realize why regular expressions might or might not find some of these characters and give you some tools so you'd be prepared to deal with it. In the next movie we'll talk about how you could match both of these by using wildcards and properties.
Get unlimited access to all courses for just $25/month.Become a member
61 Video lessons · 104580 Viewers
56 Video lessons · 116500 Viewers
71 Video lessons · 85700 Viewers
131 Video lessons · 41013 Viewers
Access exercise files from a button right under the course name.
Search within course videos and transcripts, and jump right to the results.
Remove icons showing you already watched videos if you want to start over.
Make the video wide, narrow, full-screen, or pop the player out of the page into its own window.
Click on text in the transcript to jump to that spot in the video. As the video plays, the relevant spot in the transcript will be highlighted.