Learn how to parse HTML when scraping the web.
- Let's look at working with parsed data in Beautiful Soup. I've broken the demonstration today into three sections. Parsing data, getting data from a parse tree, and searching and retrieving data from a parse tree. Parsing data is where you pass an HTML or XML document to a Beautiful Soup constructor. The constructor converts the document to Unicode and then parses it with a built-in HTML parser. Well, HTML parser by default. Looking closer at searching and retrieving data, I'm going to show you the find_all method, and this method searches a tag and its descendants to retrieve tags or strings that match your filters.
There are several methods for searching and filtering a parse tree. The ones that I'm going to show you today are the name argument, keyword argument, string argument, lists, Boolean values, strings, and regular expressions. You can pass any of these arguments into the find_all method to use as filters and return either strings or tags. I'll show you in our demo. Data parsing is super simple with Beautiful Soup, you just pass in an HTML or XML document to the Beautiful Soup constructor.
The constructor converts the document to Unicode and then parses it with a built-in HTML parser. In this demonstration we're going to use pandas, so we'll import that, and also we need to import our Beautiful Soup library from bs4, import BeautifulSoup. Also, I want to show you how to use regular expression objects so we need to import the regular expression operations. A regular expression just defines a set of strings that matches it.
To import the regular expression library you just say import re, and we'll run these so we have our libraries. Now I'm going to create an object called r and it's going to be our HTML document. It's going to have the same HTML code that we've been using throughout the web scraping discussion, so I'll just copy and paste that in, and run this. Just looking at this code here you can see that it's totally unstructured at this point, but we'll fix that by converting it to a Beautiful Soup object.
We'll call that soup, call our BeautifulSoup constructor, pass in our object r, and this time we're going to say we want it to use the 'lxml' parser, and then let's just look at the data type of the soup object. We print it out and we see we have created a Beautiful Soup object called soup. On to parsing our data. Let's look at the parse tree again.
We'll print out the first 100 characters to see what that looks like. To do that we say print soup.prettify and then just select zero through 100. Now I want to explain to you how you can get data from this parse tree. Imagine that you want the text part of the soup object. You can return all of the text with HTML tags stripped out by using the get_text method. It returns all of the text in a document as single Unicode string.
So let's try that now, we'll create a new object called text_only and write the name of our soup object and then just call the get_text method off of it and print it out. And as you can see here, we've returned only the text from our HTML document. As you might imagine, this is a great first step in web scraping. Now I want to show you how to search and retrieve data from a parse tree. Let's start by retrieving tags by filtering with name arguments.
With this method we're going to search for tags by filtering based on the tag name. To return all the tags that contain HTML list items call the find_all method off as a soup object and then pass in the name of the tag, li. So here's our object, soup, and then we call the find_all method and pass in the name of our tag we're interested in, which is li. As you can see, this returns all of the tags in our HTML document that are tagged with the li tag. Now let's retrieve tags by filtering with keyword arguments.
In this method you search for tags by filtering based on tag attribute. To return all of the tags that contain an id attribute of three, we first write the name of our soup object and then we call the find_all method off of it and we'll pass in (id = "link 3"). The find_all method then uses its match method to find all tags with an id attribute that equals to "link 3", and returns only those.
So you can see here that it's returned the record with id = "link 3". You can also retrieve tags by filtering with string arguments. In this method you search for tags by filtering based on an exact string. We do that by writing the name of our soup object and then calling the find_all method off of it and passing in the string that reads ul. This then returns all of the tags that contain a string value of ul. You can see right here that this tag contains a ul and that's why it's been returned.
Now let's look at how to retrieve tags by filtering with list objects. In this method you search for tags by filtering based on lists. To return all of the tags that contain a set of string values you can use a list to do that. Let's practice here, we'll write the name of our Beautiful Soup object and call the find_all method off of it, and then we're going to pass in a list that contains the strings ul and also b. This is going to return all of the tags that have a tag name ul or b, so we see our b and our ul here.
Very good. The reason that b got read in first was 'cause it was higher up in the HTML document and ul came after it, so it reads out in the same order as the HTML. To return all of the tags that contain a regular expression you can pass in a regular expression object to use as a filter. We call the re.compile to compile a regular expression pattern into a regular expression object, which can be used for matching against Beautiful Soup's match method. So we'll say l = re.compile, and then we pass in the string l.
This is going to be converted into a regular expression. Now we're going to write a for loop, and we're going to say that for each tag in the soup object we want to return and print the name of each tag that contains the letter l in its name attribute. We do that by saying for tag in soup.find_all(l): print: the tag.name.
So now we have a list that's been printed out with a name of each tag that contains the letter l in its name attribute. You can retrieve tags by filtering with Boolean values. In this method you search for tags by filtering based on true, false values. To return all of the tags that are contained in a parse tree you can pass in a Boolean value to use as a filter. The find_all function accepts Boolean values, so if you want to print out all HTML tags from within the soup object, we can just use that same loop but pass in the value true as an argument to the find_all function.
So let me copy and paste the same loop but instead of looking for l, we're going to look for true values. And when we print this out, you can see all of the HTML tags that were used in the original HTML document. Now I want to show you how to retrieve web links by filtering with string objects. To return all of the web links from within a parse tree, you can pass in a string object to use as a filter.
Let's start by isolating only web links from within the soup object. We do that by calling the find_all method off of the soup object, and passing in the tag a. We will write a for loop that passes through every tag in the soup object to search for all the a tags. For each a tag that it finds, it gets the href value and prints that out. To do that we say, for link in our soup object, find_all with the tag a, and then print with the link.get, and then pass in a string that reads href.
As you can see, all of the web links from the soup object have been retrieved. That is a simple mechanism you can use to scrape web links from a web page. The last thing I want to show you in this section is how to retrieve strings by filtering with regular expressions. To return all of the strings that contain a regular expression, you can pass in a regular expression object to use as a filter. We're again going to use our soup object and call the find_all method off of it. Now let's say we want to return strings from all of the tags that contain the word data.
To do that we pass in an argument that says string = re.compile, and inside the compile function we'll pass string that says data. The find_all method then returns a list of strings from the original web page, all of which contain the word data. If you've made it this far, then you've basically covered all of the mechanics of scraping web data with Beautiful Soup. Next I'm going to show you how to use this stuff in action.
- Getting started with Jupyter Notebooks
- Visualizing data: basic charts, time series, and statistical plots
- Preparing for analysis: treating missing values and data transformation
- Data analysis basics: arithmetic, summary statistics, and correlation analysis
- Outlier analysis: univariate, multivariate, and linear projection methods
- Introduction to machine learning
- Basic machine learning methods: linear and logistic regression, Naïve Bayes
- Reducing dataset dimensionality with PCA
- Clustering and classification: k-means, hierarchical, and k-NN
- Simulating a social network with NetworkX
- Creating Plot.ly charts
- Scraping the web with Beautiful Soup