- [Instructor] Python provides a built in way for parsing structured data such as HTML, as well as other kinds of data just like we saw previously with JSON. And in this example, we're going to see how to create our own HTML parser based on the HTML parser class that Python provides. So, let's go ahead in chapter five, and open up HTML parsing underscore start. And you can see I've already created my main function, and I've got a variable named parser, which I'm instantiating as a my HTML parser class.
So we're going to create that class and feed it some HTML, and then watch as it parses the HTML that we give to it. So let's begin by importing the HTML parser class that we need. And we do that from HTML dot parser from... Import. HTML parser. Then, let's go ahead and define our class. So, I'm going to write class. My HTML parser. And we're going to sub-class the existing HTML parser class that Python provides.
So we'll come back to this in a moment. Back in the main function, let's add the code needed to open the HTML file and parse it. So I'm going to write F equals open, and I'm going to open my sample HTML file. Dot HTML. And if F dot mode... Is equal to R, which means that we successfully opened it. Going to write contents equals F dot read.
So we'll read the entire file. And then I'll call parser dot feed and pass in the contents. Now, I could use the URL lib to open up a URL, and read the HTML data straight from the web, but in this example I'm not going to do that. I'm just simply going to open up a sample HTML file that we have right here called sample HTML dot HTML. And we've already seen how to do this with files earlier in the course. So if we look at that file for the moment, you can see it's a pretty standard HTML file.
There's a head and there's a body section, and in the head we've got some meta tags, and title tags, some links. Down here in the body there's some content. So, it's just a skeleton HTML file that we're going to use to test out the parser. Once this file has been read into memory, we're going to pass the contents to our parser class. So our parser class has a function named feed on it. So we're going to pass in the string that represents the HTML, and the parsers going to work on it. When you pass the HTML content to this feed function, it's going to take the HTML and run through it line by line, and each time it encounters a specific kind of data inside the HTML like comments, or tags, or text data, it's going to call functions that you override in your sub-class.
Let's go back up to the class, and I'll define a method called handle comment. So I'll write def handle comment, and it takes in some data. And again what I'm doing here is I'm overriding the default implementation of handle comment that's already in the HTML parser class. So I'm going to just print out encountered comment, and the data that represents the comment. And then I'm going to write a line of code that says pos equals self dot get pos.
I'm going to get to position where this comment was encountered and then I'll print with a tab at line. And then that's going to be pos zero position. And that's going to be pos of one. So in this case when the parser comes across a comment, it's going to call my handle comment method, and it's going to pass the text data. So all I'm going to do is print out the string, and counter comment along with whatever the data is.
And then we're going to get position in the file where the comment was found using this function named get pos. This function comes back with two things. Comes back with a line number, and a character position in the data where, in this case, the comment was encountered. Now we've got something that we can run so let's save. And let's go to the debugger and run this. Open up the output window and we'll run. And we can see encountered the comment, and here's the comment that we came across. At line nine and position four.
Let's see if that's correct. I'm going to open up the HTML file. And so here's line nine. And there's position one, two, three, four. Yup, so that looks like it's correct. So let's add a few more handler functions to the class. So I'll write one for handle start tag. And that takes the tag and some attributes. And I'll define another one named handle end tag.
And then I'll write another one named handle data. And data is just text data. Okay, so we've added some functions to handle tags and text data. So let's fill these out starting with the handle data function, since it's basically the same as the one for handling comments. So I'm just going to copy this, and paste it in here and change this from encountered comment to encountered some data, and everything else is pretty much the same. Now the main difference is that we only want to print actual text content and skip over white space lines.
So I'm going to use Python to make sure the string doesn't consist just entirely of white space using the is space function. So I'll write if data dot is space. Then return. Otherwise, we'll print out the actual data. So now let's handle the end tag. This is kind of the same as the other two so far. So I'll just copy this. And paste it in.
And I'll just change it to handle the tag. And that just leaves the handle start tag function. So this function gets called when the closing angle bracket of an opening tag is reached. So let's go back to the HTML and I'll show you what I mean. When this angle bracket right here is reached after parsing this entire title tag, that is when handle start tag is called. And of course start tags can have attributes on them like you can see on this meta tag, there are attributes all the way through.
And so they will get passed in this attributes argument here. So I'll fill in the code for this function, and it starts out just the same as the others. So I'll just copy these lines and paste. Right. And in this case, again it's a tag. But there are two things I want to do in addition to this. First, I want to count the number of meta tags in the file. Let's suppose that for some reason I wanted to count this. So, I'll make a global variable.
And again, there's plenty of ways to do this, but I'm just going to demonstrate this way. So I'll have a variable named meta count. And I'll set that to zero. And then in my code to handle the start tag, I'll check to see if this is a meta tag, and increase the count if it is so. So I'll simply say, if tag is equal to meta, then meta count plus equals one. And remember it's a global variable, so I have to say global meta count. The other thing I want to do is print out any attributes that the tag has on it.
And remember in HTML only start tags can have attributes. So the attributes are passed in the attributes argument. I just need to see if there are any attributes by checking the length of that collection, and then printing them out if there are any. So I will write if attrs dot, and there's a length property, so I'll write underscore underscore len. And that's a function is greater than zero. Then I'll print and again I'll indent with a tab.
Attributes. And then I'm going to loop over all the attributes. So I write for a in attrs. And then print. Once again I'm going to indent with a tab. The attribute name. Followed by an equal sign. Followed by the attribute value. Alright, so let's save all this, and then let's run the app one more time. So let's go to the debugger. And I'll show the output window. Let's clear that. And I'll run this.
Looks like I have a syntax error on the data in space. Hang on one second. Let me just fix that really quick. First of all, I don't need that. And I do need the colon. Here we go. So once again let's run. Alright, so let's take a look at the output. So you can see here in the output that the HTML tag is first and it has a language attribute on it. So language is equal to EN, right. Then comes the head. And then comes a meta tag.
It's the first meta tag. And as we progress through the output, you know we can see more tags with attributes. And let's see. There's the comment that we saw earlier. And if we go all the way to the end we can see, I never printed out how many meta tags there were. So let met just print that out at the bottom. So here in the main function, what I'm going to do is print out the total number of meta tags that were found.
So going to print. Meta tags... Found plus the stir of... Meta count. Alright so let's go ahead and run. Let's clear this output. Run it again. And you can see that all the way at the end is says that there were four total meta tags counted. So let's go back to the HTML and see if that's correct. So there's one, two, three and four. Yup, looks like that worked properly.
Alright, so that's a basic example of using Python to process HTML by building your own HTML parser sub-class based on the HTML parser class provided by the Python standard library. And once again, this is all documented on the python.org website. So if you want to learn more about how the HTML parser class works, and other features you can override in it, go check out those docs.
- Installing Python
- Choosing an editor or IDE
- Working with variables and expressions
- Writing loops
- Using the date, time, and datetime classes
- Reading and writing files
- Fetching internet data
- Parsing and processing HTML