Join Mike Chapple for an in-depth discussion in this video Outliers in subgroups, part of Cleaning Bad Data in R.
- [Instructor] In addition to straightforward…outlier detection, you should also examine…your data set for outliers that might…appear in subsets of your data.…This is another case where applying…domain knowledge is quite helpful.…Consider as an example, a data set…containing test scores for students…in an elementary school that were…administered a grade level standardized test.…I've provided the code here to load that data file.…Let's go ahead and load the tidyverse,…set our working directory, and then…read in the tests data set.…
I'm going to start by looking…at some summary statistics.…I see that I have a student identifier…that's an integer value, an age that's a numeric value,…a grade level, and a test score.…And one thing that jumps out at me right away…is that the ages in this data set…range from five to 39.…Now that sounds suspicious for an elementary school.…Let's dig into that variable more…by looking at a box plot.…Now there certainly shouldn't be…students in elementary school that are…in their 20's and 30's, but I see here…
Author
Released
8/22/2018Where possible, instructor Mike Chapple shows how to correct the issues using R, but the same principles can be applied to any statistical programing language.
- Missing data
- Duplicate rows and values
- Converting data
- Formatting data
- Working with tidy data
- Tidying data sets
- Dealing with suspicious data
Skill Level Beginner
Duration
Views
Related Courses
-
Data Wrangling in R
with Mike Chapple4h 12m Intermediate
-
Introduction
-
Data is messy1m 10s
-
What you need to know1m 9s
-
-
1. Missing Data
-
Types of missing data3m 38s
-
Missing values11m 25s
-
Missing rows5m 58s
-
-
2. Duplicated Data
-
Duplicated rows and values4m 50s
-
Aggregations in the data set3m 42s
-
-
3. Formatting Data
-
Converting dates5m 54s
-
Unit conversions3m 50s
-
Numbers stored as text3m 32s
-
Inconsistent spellings6m 51s
-
-
4. Outliers
-
Screening for outliers4m 53s
-
Handling outliers1m 58s
-
Outliers use case3m 34s
-
Outliers in subgroups3m 33s
-
Detecting illogical values3m 14s
-
-
5. Tidy Data
-
What is tidy data?3m 59s
-
Common data problems7m 57s
-
Wide vs. long data sets3m 23s
-
Making wide data sets long4m 37s
-
Making long data sets wide3m 41s
-
-
6. Red Flags
-
Suspicious values4m 49s
-
Suspicious multiples2m 25s
-
-
Conclusion
-
What's next?1m 5s
-
- Mark as unwatched
- Mark all as unwatched
Are you sure you want to mark all the videos in this course as unwatched?
This will not affect your course history, your reports, or your certificates of completion for this course.
CancelTake notes with your new membership!
Type in the entry box, then click Enter to save your note.
1:30Press on any video thumbnail to jump immediately to the timecode shown.
Notes are saved with you account but can also be exported as plain text, MS Word, PDF, Google Doc, or Evernote.
Share this video
Embed this video
Video: Outliers in subgroups