Learn what Natural Language Processing (NLP) is by first looking into its sub-fields and its relevance to data science. Jungwoo also explains software tools available for helping your NLP task as a data scientist.
- [Voiceover] Natural Language Processing, or NLP, refers to a collection of different ways for a computer to make sense out of its interactions with a human being through a natural language. NLP is a comprehensive discipline in computer science and involves topics such as artificial intelligence, computer linguistics, and human computer interaction, or HCI.
There are NLP subfields that are particularly relevant to a data scientist. Tokenization, parsing, sentence segmentation, and named entity recognition are some of them. Tokenization and parsing isolate each text symbol from a text and conduct a grammatical analysis. Sentence segmentation separates one sentence from the other in a text. Named entity recognition identifies which text symbol maps to what types of proper names.
A significant portion of data you're dealing with as a data scientist is unstructured. That is, they are text extracted not from a database, but from sources such as social media sites, text documents, pictures, and so on. Therefore, one of the biggest challenges of a data scientist is to sort through this unstructured data and pre-process it so that data mining and analytics tools can take over to extract the ultimate knowledge they are seeking.
Luckily for the data scientists, there are already well-developed NLP tools patched into program languages such as Python. Some of these tools are also built into an operating system such as Unix or Linux.
Jungwoo Ryoo is a professor of information science and technology at Penn State. Here he reviews the history of data science and analytics, explores which markets are using big data the most, and reveals the five main skills areas: data mining, machine learning, natural language processing (NLP), statistics, and visualization. This leads to a discussion of the five biggest career opportunities, the four leading industry-recognized certifications available, and the most exciting emerging technologies. Along the way, Jungwoo discusses the importance of ethics and professional development, and provides pointers to online resources for learning more.
- A history of data science
- Why analytics is important
- How data science is used in social media, climate research, and more
- Data science skills
- Data science certifications
- The future of big data