Easy-to-follow video tutorials help you learn software, creative, and business skills.Become a member
In the last movie we talked about Unicode and learned how Unicode works and some of the problems that it presents when working with regular expressions. One issue we didn't address was the issue of how to use the wildcard character with Unicode. Remember that the single-dot wildcard character matches only one character. So we're dealing with multibyte characters that actually are two characters side by side, well, then our wildcard won't work anymore. There are a couple of different solutions that one could use for this. The first and probably simplest one is just \X. You'll remember that Perl and PHP use a lowercase x for Unicode instead of the u. Well, then they use the \X for the wildcard. That matches any single character, regardless of how it's been encoded.
It will also always match line breaks as well. That's the same thing as when we were using the dot but using it with that s option that would allow dot matches all. That's always built into the \X option, so just make note of that. So for example if we had caf\X that matches both cafe with the accent and cafe without the accent. It handles both of those cases. Pretty slick, huh? One problem with it is the support. It's only supported in Perl and PHP at the moment.
I'd love to see more language to start supporting it because I think it's a really useful option to have, but right now Perl and PHP are the only ones you can use that. However, there is another set of options that we can use that is supported by more languages, not quite as flexible as this, but we can use properties--that is the Unicode property. We can search for property using \p and then, in curly braces, the property that we want to look for that matches all characters that have that property. So for example, \p and then in curly braces Mark with the capital M or we can just abbreviate with just the capital M. That matches any mark such as accents--not the letter itself, just the mark.
If we want a letter, we could use Letter inside those curly braces or a capital L, and that would match any letter. Let's take a look at what those abbreviations are, so you can have them. So the Unicode property for Letter or just L, Mark or just M, whitespace operator would be capital Z, any symbol would be capital S, number would be capital N, punctuation for capital P and other--something that's not one of those other things--would be a capital C. So if you use any of these, this will allow you to find those Unicode characters that have this property. So that really help you out, because now we can look for things in Unicode which are a letter followed by a mark.
We also can do it in another way. We can use the not-property identifier. So that's a \P followed by the property, and that matches any characters that do not have a property. So for example, with cafe, we can look for anything that is not a mark, such as a letter E, followed immediately by anything that is a mark. So that takes care of the case when we have E followed by an accent, regardless of which way the accent might face or whether it's an a with an accent, this basically says find any accented character.
Get unlimited access to all courses for just $25/month.Become a member
82 Video lessons · 101773 Viewers
61 Video lessons · 88535 Viewers
71 Video lessons · 72348 Viewers
56 Video lessons · 104059 Viewers
Access exercise files from a button right under the course name.
Search within course videos and transcripts, and jump right to the results.
Remove icons showing you already watched videos if you want to start over.
Make the video wide, narrow, full-screen, or pop the player out of the page into its own window.
Click on text in the transcript to jump to that spot in the video. As the video plays, the relevant spot in the transcript will be highlighted.
Your file was successfully uploaded.