Join Barton Poulson for an in-depth discussion in this video Spreadsheets, part of Data Science Foundations: Fundamentals.
[Voiceover]- The first thing I wanna talk about in terms of Programming and Data Science is the role of spreadsheets in Data Science. Now, spreadsheets may not seem like a very fancy or sophisticated tool, but they are often the right tool for the job that you're working on. There's a few reasons for this. First off, spreadsheets are everywhere. Microsoft Excel is on nearly every PC ever made, Google Sheets can be accessed from all of them, also, spreadsheets are often the preferred format from the client.
When they give you the data, it'll often be in a spreadsheet, so you have to be able to work fluently with them, simply to get started for your analysis. Also, for data transfer, the spreadsheet format 'CSV', or Comma Separated Values, is sort of like a 'Lingua Franca' between data programs. Any program can read a CSV file; any program can write them. It allows you to transfer data from one program to another. Also, easy to use; there are some operations that are actually easier to do in spreadsheets than in other programs, and I'll show you some of the examples of those operations.
But first, let me show you the results of a survey of software usage among working data-mining professionals. So here is our survey, it's by Kdnuggets, and you'll see Excel is fifth on this list! It's above Hadoop and Spark, which are some of their prototypical big data tools. And here's why Excel might be so prominent among the rest of these. First off, it's got a lot of important uses. It's good for what is called Date Browsing, and that is simply a way of actually seeing the data in front of you and being able to scroll, left and right, up and down, and seeing what's there.
It's good for sorting the data. It's good for rearranging the rows and the columns manually. It's an hands-on way of doing it. It's good for finding and replacing, and these are jobs that are important for Data Science. And that fact that spreadsheets make it easy makes spreadsheets important. Now, there are some more uses for spreadsheets. One of them is formatting the data. Transposing the data, switching the rows and columns is very easy in a spreadsheet. Tracking changes, if you're collaborating, you can see who worked on what.
Making pivot tables - this is a pivot table right here. For some of my colleagues, the reason spreadsheets exist is to make pivot tables as a way of interactively exploring and manipulating the data. Also, spreadsheets are great for arranging the output for presentation or sharing purposes. One of the important things, however, is that spreadsheets allow a lot of flexibility and that can be a bit of a problem, because when you're sharing data with programs, it doesn't always want that much variation. Instead, you want to have what's called 'tidy data', and the idea here is that tidy data is for transferring data between programs and in this a column is equivalent to a variable and a row is equivalent to a case.
And you have one sheet per file, and you have one level of analysis per file. If you've ever worked with relational databases, then this will be a familiar setup, but it makes it very easy to transfer the data from one program to another. So, after this very brief presentation on spreadsheets, what are our conclusions? First, data scientists still need spreadsheets. It's still a critical tool, you still need to be able to work with it. If for no other reason than your client will probably give you the data in that format and will want it back in that format.
But on the other hand, it is the tool of choice for many procedures, such as data browsing and pivot tables. But don't forget, you need to have tidy data for transferring the data easily between programs.
- The demand for data science
- Roles and careers
- Ethical issues in data science
- Sourcing data
- Exploring data through graphs and statistics
- Programming with R, Python, and SQL
- Data science in math and statistics
- Data science and machine learning
- Communicating with data