Learn how to write to an output file when scraping the web.
- [Instructor] Let me show you web scraping in action. In the following demonstration, I'm going to show you how to scrape webpage and then save your results in an external file. Let's get started. For this demonstration, you're going to need to import your Beautiful Soup library, so we'll do that by saying from bs4 import BeautifulSoup, and then we're going to need urllib library in order to read in our data from the internet. We'll say import urllib, and we also need to import the regular expression library, so we'll say import re.
Run that, and then we have our libraries. Okay, we're going to scrape a page from analytics .usa .gov. Let's call our variable r, and then we're going to call the urlopen function, so we say url lib .urlopen, and then we'll pass in the URL of the page we want to scrape, and we'll say analytics .usa .gov and then .read.
Just remember, you can use any web link you want here, so you can basically scrape any page from the internet. Now, let's create a Beautiful Soup object. We'll call it soup, and we'll call our Beautiful Soup constructor, and we'll pass in r. We also want to tell Beautiful Soup to use the lxml parser, so we pass that in as an argument, and let's just check the type of our soup object. We run that, and as you can see, we've created a Beautiful Soup object.
Now, let's print it out, and we'll use the prettify function so that we can add a little bit of structure and make it a bit easier to read, soup.prettify, and then we'll just read out the first 100 elements. This is the first 100 elements from this webpage here. Now, we want to use a lib to find all the A tags, and retrieve the href values from within those. To do that, we say for link in soup, and we call the find_all method, and then we pass in the string that reads A.
We say for each of the A tags, we want to print link .get href. What Beautiful Soup has done here is it's gone through that page and found all of the A tags and then returned all of the href values for those A tags. Now, I'm going to use a for loop to pass through our soup object and find all of the A tags that have an attribute of href. We'll start that by writing for link in soup we want to find all A tags with the attribute that's equal to href, and then of all of these tags that are returned, we want the loop to match against them a regular expression that reads http, and print out only those.
In order to do that, we say re.compile, as you saw in the last section, and we create another string, and we'll say http, and then for every tag that meets this criteria, we'll print that out. We'll say print link, and run the code, and now we have all of our A tags have an attribute of href and also have an http match within them.
It isn't useful for you to have your results stuck within a Jupyter Notebook, so you need to know how to save this in an external file. To do that, we're going to create a new text file called parsed data. We'll say file equals to open, we'll pass in a name for our file, parsed _data .txt, and then pass in a wb to tell Python we want to write into this text file. Then let's reuse our loop that we created up here.
I'll copy and paste it in. For each link that's found, we need to convert it to a string before printing it out, so we're going to say soup, here, I'll start on a new line, soup _link, and we're just going to convert the output of the loop to a string. We call string and then pass in our link, and then print it out. Print soup_link, and then to write it into the file we're creating, we say file.write and pass in our soup_link object.
Then, we flush the file by saying file.flush, and then close it, file.close. To find out where your data file is, you would just say percentage pwd, and then you get an extension here that can tell you where you can find your output file. In my case, here it is; let me just open this up. We have each of the links that's been scraped off of that webpage.
The only thing I would mention here is you can see that there's still the stray tags, and a lot of times, when you're doing web scraping, no matter how much data formatting you do, there's always stray characters. There's data processing requirements after you've scraped the data. Expect some data munging after you do web scraping. If you ever find yourself again in a position where you can't get data from the website because it's placed on different pages or in weird formatting, remember how to use Beautiful Soup to scrape the data for you.
- Getting started with Jupyter Notebooks
- Visualizing data: basic charts, time series, and statistical plots
- Preparing for analysis: treating missing values and data transformation
- Data analysis basics: arithmetic, summary statistics, and correlation analysis
- Outlier analysis: univariate, multivariate, and linear projection methods
- Introduction to machine learning
- Basic machine learning methods: linear and logistic regression, Naïve Bayes
- Reducing dataset dimensionality with PCA
- Clustering and classification: k-means, hierarchical, and k-NN
- Simulating a social network with NetworkX
- Creating Plot.ly charts
- Scraping the web with Beautiful Soup