From the course: Data Visualization: A Lesson and Listen Series

Listen: Richard Brath

and recently completed his PhD in data visualization, focusing on text analytics and visualization. So Richard, thank you very much for joining me here today. - Thank you, good to be here. - So I want to start off just by talking about text analytics and visualization sort of from a high level. or negative to a politician, to a company, to a news story, to a character on a TV show. There's things like topic analysis and summarization, where you try to extract the gist out of a news story or something like that. And then there's things like network analysis, where you try to connect all the different relationships and pieces together, which are all really interesting capabilities on the processing side. And then on the visualization side, it's really fantastic 'cause there's lots of great work coming out from various different people. If we wind the clock back 10, 15, 20 years ago, there was great work too, it just wasn't, it was a little bit few and far between. So you had people like Ben Fry doing fantastic stuff with Tendril, or Brad Paley doing a map of science. And in fact there's a lot of nice examples on Katy Borner's scimaps.org. But if we wind it forwards now, there's so much more possible. And there's great things happening in AI research, where we have people like Hendrik Strobelt doing automatic visualization of translation, and we've got great things like Giorgia Lupi and Stefanie Posavec doing wonderful all their wonderful things with Dear Data. I don't know if props are okay or not, but I think it's actually pretty cool what they're doing with their creative work. - Yeah those are great examples, and yeah, Dear Data is an amazing project. I've spoken about them before in some of my other courses. So let's focus for a moment on the visualization side, 'cause yeah there's to sides to this, there's text analytics and there's text visualization. So if we take the visual approach first, I saw your Strata conference talk where you showed a couple of examples. You had your top 50 music singles and sort of demonstrated how you can visualize it either this way or that way and you had bar charts made out of text and you even had you approach to showing line charts where instead of the line being just a line the line was the text itself. Which, the designer in me who's always encouraging people to simplify and reduce visual distraction was sort of like, ooh, oh no is that a good idea? But the reader in me loved it, and I'm being convinced that that's actually a really great technique. So I just wanted to ask you a little bit about what is so important, let's say, about using the text itself as the visual element, as the example being in the line where the text, the example was countries being measured on some measure and literally it was the word Poland, Poland, Poland all along which made up the line shape itself. So why is that so important? Why do you think that that works? - Right, so I think there's a couple of different things that are going on here, and one of them is, we already talked about, on the analytic side We have HR people. We have some, obviously, technical people and people who do data visualization. But where can these people start to think about sort of starter level text analytics? Forget about the visualization for a moment. How do they really sort of start getting into this? - There's kind of a text analytics 101 and a visual analytics 101. And in the text analytics 101, what are the basic things that we're doing when we think about text? There's a lot of meaning in these sentences and paragraphs and this whole conversation that we're having, but we start at the basic unit of the words. We extract the words themselves, then we're able to start looking at thing about those words once we've extracted them, and we might be able to do things like detect sentiment: angry words, happy words, emotion. We might be able to detect things like topics in there based on commonality of words. And so, we're really using these words as the starting level to start building up some kind of meaning around them. And then, what you can start to do is connect these different words together. So, I actually show an example of Grimms' Fairy Tales, where you say instead of just creating a word cloud out of the text, you can say, "I'm just going to do a really simple text processing "where the first pass I extract the nouns, "the proper nouns, the characters, people in the fairy tales "or things like king, queen, witch, fox." And then, the second pass is to find the adjectives that are close by to those words, so just within two or three words from them. And just by looking at those associations together, you get things like old king, beautiful queen, and wicked witch. - Yeah, so, in order to do that, people are using tools, right? - Yes. - And so, one other question I had for you, especially I think right after that, people are going to say, "Yeah, but how do I do that?" So, is it Python and Natural Language Toolkit? Is it using online services like IBM Watson? Like, if you were starting today, where would you go first? What would you learn first to try to get towards doing this work? - Well, yeah. So, you're talking to somebody who just finished a degree in computer science, so I do have a little bit of a bias here, and I do go to the tools like Python and NLTK. Python has a lot of tutorials. NLTK has been out there in the field for a long time and has been tested. So, if you've got any inkling to do some programming, then I would recommend an approach like Python and NLTK. But I've also figured out a tool set that works for me, and so those may have advanced more than from where I was starting in. So, for example, instead of just using straight Python, there may be tools like Jupyter Notebook which make that easier to do. And then, furthermore, there are sites where you can do some analysis and processing of text. Almost every tag cloud site allows you to just drop in a giant chunk of text and out will come the top frequencies of the words. And so, there are some different sites out there that allow you to input a block of text and get some light analysis out of that. Connecting all of those pieces together end-to-end is more effort. So then, you've got some text and maybe some properties about the text. You then need to do some visualization and take that text with the properties and stitch it together. - Yeah, it seems like there are new APIs launching every day, too. So, I'm actually in the process right now of working on a project taking essentially Woodstock, it's the 50th anniversary of Woodstock this year, and taking the lyrics from the music and trying to make some sense out of it. And so, I'm using NLTK to get the TFIDF, the term frequent-inverse document frequency, and then I'm running it through IBM Watson to get sentiment analysis. And, you know, it still needs something else. I still need more techniques. And so, leads directly to my next question is going beyond some of these things. What are sort of some other really interesting, nuanced ideas that you've seen other people do or that you've done yourself that sort of even get to that next level? - Yeah, so I think you're right where you can start chopping things apart and start computing things like sentiment. And really, a lot of what's happening, I think, in the pieces that I'm doing is just kind of going to a slightly finer detail. So, instead of just the words and the frequencies, it might be the words and the frequencies in relationship to characters. There's an example that we have. Basically, we're just taking a map. We're looking at tweets on the map. And in each square on the map, we're just doing a TFIDF of that square versus the other squares and seeing what are people talking about in that particular square. And so, then that gives you a sense of like, here are the kinds of things that are happening here. And in fact, over top of a city like New York, the tweets, the things that stand out the tweets, the things that stand out are the actual content about those areas. are the actual content about those areas. So, it'll be things like Central Park, or MOMA, So, it'll be things like Central Park, or MOMA, or Brooklyn Bridge, or things like that. or Brooklyn Bridge, or things like that. - Yeah. - Yeah. Yeah, so for those of you, Yeah, so for those of you, I've already mentioned a couple of times, I've already mentioned a couple of times, Richard just mentioned it again. Richard just mentioned it again. TFIDF, term frequent-inverse document frequency, TFIDF, term frequent-inverse document frequency, means a word is being used a lot in this context, means a word is being used a lot in this context, in this particular piece of text in this particular piece of text and it's not used very much in the whole corpus of text and it's not used very much in the whole corpus of text that you're comparing against. that you're comparing against. So, you might see MOMA said a lot here, So, you might see MOMA said a lot here, and it's not said a lot over here, and it's not said a lot over here, so therefore it's really important for that area. so therefore it's really important for that area. 'Cause like the word the might be said a lot here, 'Cause like the word the might be said a lot here, but it's also said a lot everywhere else, but it's also said a lot everywhere else, so therefore it's not that important. so therefore it's not that important. That's what that is in reference to. That's what that is in reference to. - Correct, correct. - Correct, correct. And I think a lot of times And I think a lot of times when you're working with machine learning when you're working with machine learning and advanced analytics, and advanced analytics, having algorithms that are very simple and easy to explain having algorithms that are very simple and easy to explain and simple and easy to understand, like TFIDF, and simple and easy to understand, like TFIDF, is actually an excellent example of something is actually an excellent example of something that people can intuitively grasp and understand, that people can intuitively grasp and understand, and you get a lot of insight and value and you get a lot of insight and value by using something like that. by using something like that. - Yeah, absolutely. - Yeah, absolutely. Can you share with us maybe your favorite, Can you share with us maybe your favorite, most interesting, most wow examples most interesting, most wow examples of text analytics and visualizations of text analytics and visualizations that you've seen out in the universe that you've seen out in the universe in the past, let's say, year or so? in the past, let's say, year or so? Any top of mind examples? Any top of mind examples? - I kind of ran through a whole bunch - I kind of ran through a whole bunch right at the beginning, right at the beginning, and I'm always hard-pressed to pick one. and I'm always hard-pressed to pick one. I think right now my favorites I think right now my favorites are actually the kind of things are actually the kind of things that Giorgia and Stefanie are going through, that Giorgia and Stefanie are going through, but those are more on the creative side. but those are more on the creative side. I think where, I think where, if you look at where machine learning is going if you look at where machine learning is going and where the leading, bleeding edge is, and where the leading, bleeding edge is, I think some of the research that's going on right now I think some of the research that's going on right now in explainable artificial intelligence in explainable artificial intelligence is actually really very interesting. is actually really very interesting. And there, you get into all kinds of challenges And there, you get into all kinds of challenges that people are dealing with. that people are dealing with. And so, some of those are things like machine translation. And so, some of those are things like machine translation. What is actually happening inside the black box? What is actually happening inside the black box? That makes me pretty excited and wow. That makes me pretty excited and wow. The visualizations themselves are using techniques The visualizations themselves are using techniques from a visualization perspective from a visualization perspective that have been around for a while. that have been around for a while. There's things like word trees There's things like word trees that were done by Wattenberg 10, 15 years ago. that were done by Wattenberg 10, 15 years ago. So, there's a lot of leveraging So, there's a lot of leveraging of all the prior knowledge that we've built up of all the prior knowledge that we've built up to create these new kinds of things. to create these new kinds of things. And this prior knowledge And this prior knowledge could be coming from lots of different areas. could be coming from lots of different areas. So, if you look at it, there's computer scientists So, if you look at it, there's computer scientists that are generating techniques, that are generating techniques, but there's also people like designers but there's also people like designers who are creating techniques just by using papers and pens who are creating techniques just by using papers and pens and doing analysis. and doing analysis. So, you can look up some of Stefanie's work So, you can look up some of Stefanie's work where she just highlights books and then marks things up where she just highlights books and then marks things up to create her visualizations, right? to create her visualizations, right? So, you just need different ways to get going. So, you just need different ways to get going. You don't necessarily need to use You don't necessarily need to use the absolutely latest computer-based techniques the absolutely latest computer-based techniques if you're just trying to experiment if you're just trying to experiment and understand what might be some of the possibilities. and understand what might be some of the possibilities. - Yeah, that makes sense. - Yeah, that makes sense. And it's true. And it's true. I mean, it's like all of data visualization. I mean, it's like all of data visualization. You have the spectrum, from artistry and design You have the spectrum, from artistry and design to really deep analytics, statistics, technical solutions, to really deep analytics, statistics, technical solutions, and there's this whole world of ideas and there's this whole world of ideas in between those two ends of the spectrum. in between those two ends of the spectrum. And text analytics is so interesting And text analytics is so interesting because people struggle with the idea of because people struggle with the idea of how do I visualize this amorphous thing that is words? how do I visualize this amorphous thing that is words? And the recommendation I gave in my lesson, And the recommendation I gave in my lesson, which is right before this in LinkedIn Learning here, which is right before this in LinkedIn Learning here, is essentially you have to think of text is essentially you have to think of text as something that can be measured. as something that can be measured. And you can measure it in many different ways, And you can measure it in many different ways, and you turn this text into quantifiable stuff. and you turn this text into quantifiable stuff. And then, that's what you are then displaying And then, that's what you are then displaying in your visualizations. in your visualizations. In some cases, that isn't exactly true In some cases, that isn't exactly true 'cause sentiment isn't necessarily quantifiable, 'cause sentiment isn't necessarily quantifiable, but in many cases, you're measuring things but in many cases, you're measuring things and visualizing those measurements. and visualizing those measurements. And once you sort of think about it that way, And once you sort of think about it that way, you can sort of turn that abstract you can sort of turn that abstract into something much more tangible. into something much more tangible. I think that may help people think about I think that may help people think about how to do some of this work. how to do some of this work. - Right, right. - Right, right. And it's about not just the words, And it's about not just the words, but the words in their relationship to other words, but the words in their relationship to other words, whole sentences, phrases, paragraphs. whole sentences, phrases, paragraphs. It's about the letters within words. It's about the letters within words. There's so much context, There's so much context, different ways that you can look at the relationships different ways that you can look at the relationships that are going on within a text that are going on within a text to gain insight and understanding about it. to gain insight and understanding about it. - Yeah, meaning, emotion, connections. - Yeah, meaning, emotion, connections. Yeah, there's endless nuance, (laughs) I think, Yeah, there's endless nuance, (laughs) I think, in text analytics and visualization. in text analytics and visualization. Well, Richard, I think we're out of time. Well, Richard, I think we're out of time. I really wanted to thank you for sharing your insights. I really wanted to thank you for sharing your insights. This is such a huge topic, This is such a huge topic, but I think that we've touched on some important ideas here but I think that we've touched on some important ideas here that people can now go out and do further research that people can now go out and do further research and find other great examples. and find other great examples. So, I thank you very, very much So, I thank you very, very much for giving us your time here today. for giving us your time here today. - Thank you. - Thank you.

Contents