Easy-to-follow video tutorials help you learn software, creative, and business skills.Become a member
In this chapter we'll discuss Unicode and multibyte characters. If you live in a country whose language consists of characters outside of the Roman alphabet, characters besides simple a to z, then this information is going to be essential. But even if you only work with the Roman alphabet, this is still going to be an important concept for you to understand, and to understand how it impacts the work you do with regular expressions. First, let's talk about what a single-byte character encoding looks like. The idea is that you use one single byte of storage space, either in memory or on a hard drive, to represent a single character. 1 byte is equal to 8 bits.
Each bit has a single on or off state, 0 or 1. With two possibilities per bit and 8 bits total, the number of combinations that we can have is 2 to the 8th power. That allows for 256 characters to be stored in a single byte. In the early days of computers, that's exactly what they used, and that accommodated capital letters A through Z, lowercase letters, digits, punctuation, and common symbols, like dollar sign or the copyright symbol. But over time, people started to realize that 256 really wasn't that many.
Over time, we kind of knew that we're going to need more than that for all the characters that are out there. So we started using double byte for storage. So that's another kind of character encoding that uses 2 bytes for each storage--that's 16 bits to represent a character. That doesn't mean twice as much; it's not 256 times 2; it's exponential, so it's 2 to the 16th power, which is over 65,000 characters. So for a while there were these two encoding systems. You could either be single-byte encoding or double-byte encoding. But we started to realize that that really didn't fully address the problem.
There are so many more characters than what's in the English alphabet. Of course, just in Latin characters, we have all the accents that go over various characters. I've shown you just the ones that go over a. There is also a number of symbols: less than or equal to, greater than or equal to, not equal to, the Euro symbol, the Pound symbol--and the Euro symbol actually didn't come along till later. We need to be able to accommodate those in our character encoding. And then there are all of the characters of all the languages around the world. As computers spread around the world, it was no longer okay to just deal with the English alphabet anymore.
Suddenly we had to allow people to work in their word processor and be able to type Arabic or Chinese or Greek, whatever language they were comfortable with. So when you take all of those together, that's over a 100,000 characters that we have to deal with. Remember, double byte encoding dealt with 65,000, so clearly it's not enough to handle those. So a new system was needed, and that's where Unicode comes in. Unicode uses variable byte size. For some characters it just uses one byte; for some characters is just uses two; for others it uses three.
And it's a system of doing that that maintains compatibility with those old one- and two-byte encoding systems. But Unicode allows for over one million characters, so that handles not just all the characters that we know we have to deal with now, but also allows us plenty of room for expansion in the future. So how does Unicode work? Well, Unicode is a mapping between the characters that we want to represent and a number. The way we represent that number is with a capital U and a plus sign followed by a four-digit hexadecimal number. Hexadecimal means it can be the numbers 0 through 9 followed by a through f.
If you've worked with HTML before, you may have hexadecimal numbers to specify colors. So for example, if you had the infinity sign, the Unicode encoding for that would be U+221E. So you can see the 221E is the unique part. You can also not only have a single encoding, but you can have combinations. So for example e with the acute accent over it can be written in two ways. It can either be U+00E9, which is going to be just a single-byte encoding for an e with an accent over it, or we can actually encode it as a double byte, U+0065, which is just a plain-old e, followed immediately by U+0301, which is the accent that goes over it. And the combination of those two is equivalent to the other one, but they are stored differently.
It's a different encoding. It's like two different things. So you can combine more than two as well. It's not just a double byte, but you can have three of those together, and in some languages you'll need three to be able to store the character. So now that we have an understanding of how Unicode and multibyte characters work, in the next movie, let's talk about how we can use regular expressions to match them.
Get unlimited access to all courses for just $25/month.Become a member
61 Video lessons · 96429 Viewers
56 Video lessons · 110176 Viewers
71 Video lessons · 78955 Viewers
131 Video lessons · 37884 Viewers
Access exercise files from a button right under the course name.
Search within course videos and transcripts, and jump right to the results.
Remove icons showing you already watched videos if you want to start over.
Make the video wide, narrow, full-screen, or pop the player out of the page into its own window.
Click on text in the transcript to jump to that spot in the video. As the video plays, the relevant spot in the transcript will be highlighted.