Many people think of data as numbers, but there is an entire universe full of another kind of data: text. Telling data stories based on the text of a book or tweets, for example, can lead to fascinating insights. Learn about how to think about text as a data source and ways to approach analyzing and visualizing the results.
(upbeat music) - So, you have some text. Say, a book or a million tweets, or a year's worth of newspaper clippings and you want to visualize it. What are you going to do? Word cloud! Of course! This is the go-to answer for many people when they think of analyzing text and creating a visualization of it. And no wonder. Word clouds can look kind of nice and they can communicate something about the text.
Specifically, the frequency of the appearance of different words. They're not perfect, but they do an okay job, if the goal is to show the most commonly used words and therefore maybe, implied topics or themes, or sentiments in a collection of text. But what if you have more ground to cover? What if your data source is text, but you want to go beyond word use frequency? There are a bunch of options and later we'll be talking to my guest, Richard Brath, about some of those. But, I just want to share with you a project I did, that illustrates some of the different ways of thinking about working with text.
I took the novel Anna Karenina, which has one of the best opening lines in all of literature. "Happy families are all alike; every unhappy family is unhappy in its own way." So, you know this is going to be a book about happiness, or the lack-there-of and you may also know, it's famously a love story. So, that's what I wanted to investigate, love and happiness in Anna Karenina. With that as my premise and the data-source being the hundreds of pages of text, in the book.
I had to come up with not only a visualization plan, but a data-analysis plan as well. Data-analysis isn't the focus of this video, but let me explain what I did briefly. Originally I was going to use Python and the natural language toolkit, to try to detect character happiness and love in some sort of automated way. But I'm not an advanced Python programmer and I had zero experience with NLTK. So, while this may be possible to do using machine learning, I quickly took another path.
Instead I hired a half dozen people, on a freelancing website, to read a randomly assigned chapter of the book. That persons job, was to make note of when the four key characters, in the book, entered and exited a scene, when they were thinking about or talking about their lover, when he or she wasn't around, and how happy they were overall in the chapter. So, essentially I was able to transform the raw data, of hundreds of thousands of words, into a spreadsheet of data entries tracking these characteristics.
In this way, my text-data was transformed into numerical data. This is at the heart of most text visualizations, it's about counting frequencies of words to see in word clouds. Or translating words into other countable variables, like levels of happiness, fervor of love, number of times actions were taken et cetera. Text visualization takes many forms, but a few common ideas dominate, in addition to word clouds. Like arc diagrams, which are used to show the connections between different words, topics and concepts, throughout a linear text-corpus.
Or a traditional network diagrams, also showing connections, but without a linear progression. Or strip plots, showing where certain words, topics, emotions, word types et cetera. Might exist in a text. Other text visualizations take different tax, like these great examples visualizing the text in the Bible, in a few different ways. First looking at more than 63 thousand cross references, in the Bible, chapter by chapter. Next, looking at a network diagram of all the people and place names, allowing you to see the most common and how they all connect to each other.
And finally, a distribution diagram of those names, showing the average location and frequency of the words in the Bible. For instance, Israel is mentioned quite a bit and throughout the entire book, which is why it's large and in the middle. Now, coming back to Anna Karenina, I wanted to create something that would be easy to read and use simple shapes and would allow the viewer to get an overall impression, while also allowing them to spend some time digging into the details. I also wanted the reader to be able to see characters individually and collectively as couples and to compare the two couples to each other.
After some context setting text, you see a large visualization of all of my data-points. First, each vertical strip is a chapter, the key characters are color-coded, so you can easily see Anna in crimson, Alexei in blue, Kitty in pink and Levin in green. And you can see patterns of happiness, by the size of the bars. So, Anna and Alexei are very happy at times and not so happy at other times. You can also see when each character is thinking about his or her lover. Those little circles, or talking about his or her lover, the dots, when they're not around.
A decent proxy for love perhaps and most importantly, I decided to tell a story, because the visualization itself can only go so far. So, I provided a little context along the timeline of the chart, since it represents time as it passes in the book. I also told four mini stories, outlining the for key components of my story. First, how happy are these people? Well, overall Kitty and Levin are happier than Anna and Alexei.
And Anna is joyous less often than the other characters. Next I examined how happy our key characters are, with or without their lovers. Anna is happier with her lover, but Alexei? He's happier without Anna. Then I looked at my proxies for love. Anna is constantly thinking and talking about Alexei when he's not around, that feels like love. But Kitty almost never thinks about, or talks about Levin when he's not around. If you've read the book, you may remember not quite believing she loves him and this might be one reason why.
Finally, I added a bonus feature, not related to my theme. Which was just looking at who is this book really about? It's a well known theory, that Levin is really the author, Tolstoy. And the data seems to bare that out, no one is in the book as much as he is. And he's alone far more than anyone else, so we hear his inner monologue the most. This example takes text, transforms it into numerical variables and then visualizes those variables. It converts text into emotions and actions that are quantifiable.
And the visualization presents those countable things and that's the secret to text visualization. Think about what the text might reveal, how you might convert the text in a way that gets to the heart of those ideas and then visualize those countable metrics in ways that connect the text and those ideas together without, as I tried to do, losing the central idea that drove the entire effort. There are many techniques for doing this. The manual way, as I did. Or, automated using text-analysis frameworks and machine learning.
Either way, you can go to very interesting places regardless of your technical capabilities. Word clouds count word frequencies, which can reveal how important different ideas are in a text but taking it a step or two further, can be much more revealing and rewarding for your audience. Next up, we'll talk to Richard Brath, who has done quite a bit of text analytics and visualization and I'm sure will provide great insights for you.