Join Curt Frye for an in-depth discussion in this video Google Ngram Viewer, part of Learning Public Data Sets.
- I mentioned Google's public data service in another part of this course. In this movie I'd like to point you to the Google Ngram Viewer. An Ngram is a series of characters of a given length. For example, a two-character string is a bigram. A three-character string is a trigram, and so on. If you perform linguistic analysis, and want to search word usage in books published from 1800 to about 2008, the Google Ngram viewer is a great tool to keep in mind. The URL for this resource is books.google.com/Ngrams.
The basic search that they use as an example, is Albert Einstein, Sherlock Holmes, and Frankenstein. And you can see the various frequencies for word usage. Frankenstein starts low and increases as it goes along. Sherlock Holmes increases, but stays reasonably steady. And Albert Einstein is on a steady upward slope. Let's say that I want to look at how language has changed over the years. I'll create a new analysis by looking at the word "awesome". Which, in the 1960s meant awe-inspiring.
And later on, came to take on the meaning of terrific, or great. So I will clear the value in the search box, and I'll type the word "awesome". And, I'll do between 1800 and 2008. And this is a running average. Or, another type is moving. You don't need to worry about it. I'll just click "Search lot of books". And we get the result. And you can see that the word "awesome" was used, but not with any terrific word frequency going up to about 1940.
And then it increased markedly, peaking around 2004 or 2003. If I wanted to compare awesome to another word, such as "outstanding", I could type a comma, "outstanding", and click the Search button. And I'll see two lines, one for each. So I see that outstanding is a much more common word in books than awesome is. Even though awesome increased quite a bit later on coming into the 2000s.
If I wanted to search for four words, I could do that. And as a sample, I'll do "among", comma, "amongst" which is used in British English but not in American English. And in fact, you can see that my browser flagged it as a misspelling. Then I'll do "while", comma, "whilst". And it's always fun to take a look at the differences between American and British English, and that's what this is. So I'll click "Search lot of books". And I see my four trend lines. So while and among vary.
But I see that there is pretty consistent usage. Whereas amongst and whilst stay almost exactly in tandem, and they decline over time increasing only a bit after the year 2000. So that's interesting. The Ngram that you search for doesn't have to be a full word. So for example, if I were to type in the prefix narco, N-A-R-C-O, and click the Search button, I see that there is a sudden spike after about 1984, which coincides with world events.
If you want to download the raw data, you can do that. Just scroll down. I'm using my scroll wheel. And you can click the "raw data is available for download here" link. Now be warned, there is a lot of data. And it's broken up into many different files. So, if you're certain that you want to download the data, go ahead, but be advised that you will need a lot of disk space.
- Working with US census data
- Using data from the Securities and Exchange Commission
- Accessing data from other US agencies
- Finding international sources of data
- Gathering data from web-based search engines and data portals
- Visualizing and analyzing public data sets in Excel