Comma separated value (CSV) files match the standard format of a spreadsheet, but place all of the information in a text file that may be read by any program, including a text editor. In this video, Mike Chapple explains the format of CSV files and how th
- [Instructor] Data scientists are often called upon to perform analysis on data stored in many different kinds of files, in databases, and on the web. In these next few videos, I'll show you how you can import data from each of these environments and use it in your R code. Let's begin with a file format known as Comma-separated values, or CSV files. Also known as comma-delimited format, CSV files match the standard format of a spreadsheet, but place all the information in a text file that may be read by any program, including a text editor.
Each line in the CSV file is a row, or an observation, in our tidy data language. Text files don't have columns, per se, so the CSV file uses commas to separate data that would appear in different columns. So each row consists of the attribute values belonging to a single observation, separated by commas. For example, take this simple spreadsheet that shows the names, ages, genders, and zip codes for a group of three people.
The file is formatted in a fairly standard way. Each person has their own row in the table, and each table column corresponds to a single variable. If we wanted to transform this spreadsheet into a CSV file, we simply take away the table boundaries and replace them with commas. This CSV format may then be stored in a simple text file, instead of a proprietary spreadsheet format. Here's an example of a more complex CSV file.
This particular file contains the results of restaurant inspections conducted by the city of Chicago. It has over 140,000 lines, and we're just looking at the first few here. In this particular case, the first row contains the names of the variables, while each of the remaining rows in the file each represents a single observation. In this case, that's a single restaurant inspection. There's one other thing that I want to point out while we're looking at this file. Sometimes, data in a field might contain a comma.
Let me switch here to another window, where I have a single line of the CSV file pulled up. This record represents the inspection of a restaurant called Steak Bar. Notice this field that contains the listing of comments that the restaurant received during the inspection. It contains commas in several places. If we just put that field in the file as is, any program trying to read the CSV file would be confused by the commas within this field, thinking that they marked the boundaries of new fields.
We correct this by enclosing fields that contain commas inside of either single or double quotation marks. So that gives you a quick understanding of how CSV files are formatted. CSV files are one of the most common formats that you'll encounter, because almost every program can read them. They're the defacto standard for sharing data between systems.
- What's tidy data?
- Using the tidyverse
- Working with tibbles
- Subsetting and filtering tibbles
- Importing data into R
- Making wide datasets long with gather()
- Making long datasets wide with spread()
- Converting data types in R
- Detecting outliers
- Manipulating strings in R with stringr