Easy-to-follow video tutorials help you learn software, creative, and business skills.Become a member
Bytes and bytearrays are like tuples and lists except instead of containing arbitrary objects, bytes and bytearrays contain bytes. 8-bit words of data. An 8-bit word of data can hold up to 256 different values and this is sometimes a very convenient thing. In particular, it's convenient for converting strings and this is where you will see it used often. You will see it used for other binary things as well but it's often times used for converting strings. And we have a great example of this right here.
This is a text file that I created for this purpose and when I created it on my Mac, it had this lovely little pattern of international characters that makes a little picture. It's a little viral thing that had been floating around on Facebook that I got and I thought it would be great for illustrating this problem, because there are some circumstances where you cannot display it and it doesn't look right, or if you try to read it as ASCII data in Python, you will get an exception error. So I loaded it up on the PC that I am using here and I saw this and I went oh, drat.
It's not pretty. It doesn't look like it, but in fact this is a great illustration of the problem because this particular system is not handling the UTF-8 International Characters properly, whereas my other system was. They are both running the same software. They are both running Eclipse. They are both running the same version of Python and yet here we are trying to display this file here and it looks like this, whereas on my Mac, it looked different. And you will see it in a moment, we'll show you here, because we are going to convert it in a way that it will display here and we're going to use Python to do this.
So this is what the file looks like here on this PC and if you are using a different operating system and you actually see the pretty fancy characters, just shh, don't tell anybody and the example will still work just fine. I'll start by making a working copy of containers.py and we'll call this containers-working.py and I'll just close this one and we'll open the working copy and we are going to start by opening the file. I am going to call this file.
fin open and it's called utf8.txt, open it for read, and we are going to set its encoding as utf_8 and this is the exact character string that you need to use. This is meaningful inside of Python and that tells Python that when it's reading this file, that it needs to read it as UTF-8 and ignore whatever the default encoding is on your system, which is almost certainly something different than UTF-8.
UTF-8 is really, really useful encoding. When the Unicode people came up with Unicode, it's this double wide character set that doesn't work right in normal ASCII systems where normal 8-bit wide text context and they tried to get the whole world to adopt it and the whole world didn't adopt it. So they came up with UTF-8, which is a version of Unicode that works in an 8-bit encoding scenario. So the first 127 characters of it works exactly like ASCII does.
So you can set your encodings to UTF-8 safely and it will work just fine with normal ASCII code and then it has this clever system of setting high bits in order to tell the system that it needs a couple more bytes to represent a particular character. And it all happens kind of transparently behind the scenes if your system is properly implementing UTF-8. And these days most web browsers do handle UTF-8, just fine. But a lot of desktop systems don't and this one here that I am working at obviously doesn't. So we are opening this file as UTF-8 and we are telling that the encoding is UTF-8 and for it to ignore its default encoding.
I am going to go ahead and open an output file. I am going to call this utf8.html because we are opening the browser, even though we are not going to put any actual HTML in it. And we'll open that for write. We are going to setup a bytearray, we call it outbytes, initialize the bytearray, with the bytearray constructor. And a bytearray is a mutable list of bytes.
So it doesn't hold any other kind of object but bytes and we'll start iterating through the file for line in file in and then we are going to immediately iterate through the line for character in line because a string is an iterable object and we are going to use the ord built in. if ord of c, and that gives us the integral equivalent of that character.
Is greater than 127. So there is 128 values in UTF-8 that are just normal ASCII and they are 0 through 127. So if this one is higher than 127, we are going to do something special with it. And otherwise, we are just going to append it to outbytes. We are going to say outbytes.append ord of c, like that. And then if it is greater than 127, we are going to do this fancy thing here. outbytes +=.
When you use the addition operator on a mutable container type. It has the same effect as appending, but you can append more than one element at a time this way. So what I am going to do here is I am going to create a bytes object and bytes are immutable arrays of bytes and I am going to encode a string. The constructor of bytes will expect a string within an encoding and so a string is going to be this XML entity with the ampersand and the pound. If you are familiar with XML entities, they look kind of like that, where inside of here you can put a decimal value that will be interpreted as UTF- 16, which is the normal Unicode.
So in there I am going to have a format and I am going to use this format here, 04decimal. I know this is all looking very complicated. I told you this line is where all the magic happens. And I am going to use format ord(c) and then the bytes constructor is going to have an encoding, that encoding is UTF-8, because we use UTF-8 for everything wherever we can. So now what we have done is, if the character is outside of the normal ASCII range, we are going to encode it with this XML entity which can be used in an HTML context and that will allow us to display our fancy little picture.
Otherwise, if it's not greater than 127, if it's in the normal ASCII range, we just append it to our outbyte. So now we have an outbytes bytearray which has all of the characters for our string and now what we need to do is to turn it in to a string. We'll call it outstring and we'll use this string constructor and we'll construct it out of outbytes and guess what? We are going to use encoding = 'utf_8'.
Now all we need to do is to print it to our outfile, print (outstr,file = fout), and we'll print it also to the screen here so we can see it, and we'll print the word Done. So this will read our UTF-8 text from our file that we are not able to read on this system, go ahead and save this so no catastrophe happens. This will read our UTF-8 text file and it'll read it with the UTF-8 encoding and it will write it out to our UTF-8 HTML file, and for the characters that are outside of the normal ASCII range, it's going to replace them with an XML entity and that's really all that we are doing here.
So we saved it, we are going to run it, and it looks like I have got a typo some place here. Yes, right there. I needed an S. That's all right. Save that and we'll run it and there we have our fancy string. So this stuff here got converted to UTF- 16 and these are the Unicode values for each of those fancy characters and now if we refresh our file system because Eclipse doesn't like to do that for us and we open this up in the little browser inside of Eclipse, there is our fancy little picture. And so this is what it looked like in the text file.
This UTF-8 file has some interesting characters in it and so we weren't able to see that on this system and by encoding them with the Unicode XML entities, we are able to see it and there we have it. So the way that we did this is by using a bytearray. The beauty of a bytearray is that you can operate on character data because characters are bytes and a bytearray is mutable, so you can insert things, you can change it up and all we did here we basically used it as an accumulator.
As we went through the string with the bad data in it, if we found an element that we needed to operate on, we pushed all of these characters onto the bytearray, using the bytes constructor and appending them to our outbytes which is a bytearray. Otherwise we just appended the regular character. If it was within the range we just appended the regular character. So these characters here just got appended in the normal way, but these characters, we ended up using these XML entities which represent the Unicode characters and we got our little fancy guy to display just the way that we needed him to display.
So that is a very common use of bytearrays. Bytearrays are a very effective way to do things like this. You will see an example very much like this one in our example code later on in the course.
Get unlimited access to all courses for just $25/month.Become a member
82 Video lessons · 98181 Viewers
61 Video lessons · 85482 Viewers
71 Video lessons · 69439 Viewers
56 Video lessons · 101790 Viewers
Access exercise files from a button right under the course name.
Search within course videos and transcripts, and jump right to the results.
Remove icons showing you already watched videos if you want to start over.
Make the video wide, narrow, full-screen, or pop the player out of the page into its own window.
Click on text in the transcript to jump to that spot in the video. As the video plays, the relevant spot in the transcript will be highlighted.
Your file was successfully uploaded.