Join Doug Rose for an in-depth discussion in this video Use statistics and software, part of Learning Data Science: Understanding the Basics.
- Because data science is still being defined by practice, there's an extra emphasis on using common software and tools. Data scientists are like early archaeologists. So think of software as the brushes and pickaxes you'll need to make discoveries. Try not to get too focused on learning all the tools. The tools in themselves will not make you a data scientist. It's the scientific method, and not the tools, that make someone a data scientist. The tools basically fall into three categories, storing, scrubbing, and analyzing.
To store the data you can use spreadsheets, databases and key value stores. Some popular ones are Hadoop, Cassandra, and POST REST SQL. Scrubbing is a common practice to make the data easier to work with. Here you use text editors, scripting tools, and programming languages like Python and SCALLOP. Finally, there are the statistical packages to help analyze the data. The most popular are the open-source package R, SBSS, and Python's data libraries. When you use these tools, you can also visualize the data and create nice charts and graphs.
Let's first look at the tools you'll need to know to hold the data. One term that you'll hear a lot is the challenge of big data. These are the data sets that are so large that they won't fit into most database management systems. The connection between data science and big data is so close that many people think that they're one and the same. But remember that data science is applying the scientific method to your data. This doesn't assume that your data has to be big. In fact there's a great book called Data Smart, which introduces data science statistics using only spreadsheets.
Nevertheless, one of the most active areas in data science is around big data. The open-source software Hadoop is currently the most popular. Hadoop uses a distributed file system to store the data on a number of standard servers. This group of servers is typically called a Hadoop cluster. The cluster also splits the tasks so that you can run applications. That means that you can have petabytes of data on hundreds or even thousands of servers. Then you can run processes on the data in the cluster.
The two most common processes that you'll see are MapReduce and Apache Spark. MapReduce works with the data in batches. And Spark can process the data in real-time. Once you have the data collected, you might want to scrub your data. Often the data you collect is not very usable. Imagine that you're collecting millions of your customers' Tweets. If you have a Twitter account, then you probably know that sometimes you get text and other times you get pictures. When you're collecting this data, you might want to create a script that divides all the incoming Tweets into text and pictures.
Then you might want to put it back in your cluster so that you can analyze each of these groups differently. If you do this often enough, then you might want to create a small Python application that does the job over and over again. Data scientists usually spend most their time scrubbing the data. Some of them say that they spend up to 90% of their time scrubbing their data to make it more usable. Now we can use R or Python to analyze the data. R is a statistical programming language. This allows you to make connections and create correlations in the data.
Then you can present them using R's built-in data visualization. That way you'll have a nice report with a nice diagram. Let's say that you wanted to create an interesting report. Your company wanted to see if there's a connection between their positive feedback and whether it's day or night. You might capture Twitter data in your Hadoop cluster, then use data scrubbing to categorize the Tweets as positive or negative. Finally, you could use a statistical package like R to create a correlation and print out a report with a nice diagram.
Keep in mind that these are some of the most popular tools. If you're on a data science team, then you'll hear of at least one of these tools. There are many more tools that automate collecting, scrubbing, and analyzing data. There are several organizations spending large sums of money trying to fill this growing customer base. Try and remember to focus on the analysis. The tools and the data are just the vehicle to gain greater insight. So spend cautiously when buying new software tools.
- What is data science?
- Making connections with relationship databases
- Importing data into warehouses
- Recognizing different data types
- Applying statistical analysis
- Focusing on knowledge