Learn about applying functions to columns; simple regular expressions; datetime transformations; dropping null data; and pandas categorical variables.
- [Instructor] In this age of big data our life and behaviors are constantly being recorded in digital artifacts that we leave around us and in the cloud. It's sometimes easy to obtain this data and to analyze them to perhaps understand ourselves better and improve our lives. For instance, if you use Google mail, it's easy to retrieve all your messages at takeout.google.com. With these messages we will go through a combination of data cleaning and visualization to analyze mail behavior.
This video was inspired by a post by my ex-colleague, Justin Ellis. So I have now loaded my mailbox from Google and I have used this standup Python library module mailbox to convert my sent messages into a pandas frame by way of a CSV file. For privacy I'm not giving you my actual sent mailbox, but an anonymized CSV file. However, in the exercise files notebook I'm showing you the code that I used to convert my mailbox.
You may have to do something slightly different depending on the settings of your main server. So I read the anonymized mailbox back into pandas and we'll play with it. I first need to load packages though. We have columns for subject, from, to, and date.
And we can see that the subjects are a little funky. That comes from anonymizing. I have also simplified the data by keeping only the first recipient of each email. We see that the email addresses use a few different formats. We should normalize them. For instance, in the first record I see my name in quotes followed by email. Here we'll use a simple regular expression. These are very powerful and you can learn about them in many places, but those that we need is that irregular expressions the dot matches any character and the plus matches one or more of what came before.
So using the Python module re I will write the regular expression that grabs just the email. I also need parenthesis to group the part that I really want. So this matches and group zero will contain the entire matched expression, while group one just the part in parenthesis.
By contrast, a different string such as just the email would not be matched. I can now write a simple function to clean an email address. Stats from my raw string. Searches for the regular expression.
And if it's not found, just returns the string again, which must be an email or otherwise returns the group. Again let's try it out on the first record. Very good. So now we can apply this to entire columns in the data frame.
We do so by using the apply methods of the data frame. This returns an error and it is not immediately apparent what went wrong. So to find out, we drop into debug mode. Go up in the frame and print out the string that got us into trouble. The string is actually not a number, so it was a missing entry in the data frame.
So we'll just drop the missing data before applying the transformation. But also let's not forget to drop out of debug mode with quit. So messages from is going to be replaced by the result of applying clean address to the same column after dropping missing records. Pandas would automatically match the instances.
Same for to. So let's have a look. Now we work on dates. Currently these are strings. We should turn them into the daytime objects used by pandas which are really smart. For instance, they know all about daylight savings time. So let's try it on one. The method is to, date, time.
This is a universal time so I need to localize it. And then convert it to my time zone. Which is most interesting to understand what I'm doing in terms of emails. Very good. Let's apply this transformation throughout. I would create a Lambda function on the fly to apply to a change.
And then add the localization. Correct. Looking at the extent of the dates, tells us that we have about one year of sent emails. We will break apart the daytime objects in various ways. Day of the week, time of day, and fractional year which will be good for plotting.
Remember that we access daytime methods in pandas through DT. The day of the week is a good application for pandas categorical variables. These are not just strings, rather they are aware that the strings must take one of a limited number of levels. So let's create that. We'll call it day of week.
I'm missing a bracket here. And then I need to give the function the categories. This is another one of those cases where it may be easier to pack them up than to find a smarter way to do this. As for time of day and fractional year, we find these with simple operations on columns.
I get the hour and I add the minutes divided by 60. I get the year. I add the day divided by 365. And the time of day divided by 24 and 365.
Time of day is actually something that I just created. So I need to grab the column directly. Okay, we're ready to plot. Say date against time of day. I'll make the dots a little smaller. This looks reasonable with few emails a night, and many in the morning.
Some in the very early morning may be related to trips overseas. We can also look at one-dimensional histograms. The name of the variable is actually an ear, not an hist. My most active hours are during the day with a peak just before lunch. As for days of the week we first tally up the numbers of sent emails each day using value counts on day of week.
Ah, Mondays and Tuesdays are the best. And then I can plot these in a histogram. I will collect the counts in a variable. Keep the ordering of days and then plot a bar plot. Indeed Monday is the busiest. On Fridays I'm tired and on the weekend I rest.
- Installing and setting up Python
- Importing and cleaning data
- Visualizing data
- Describing distributions and categorical variables
- Using basic statistical inference and modeling techniques
- Bayesian inference