Join Barton Poulson for an in-depth discussion in this video Algebra, part of Data Science Foundations: Fundamentals.
- [Voiceover] Algebra is a key practice in data science. Algebra is the study of abstract quantities and the relationships between them. There are three kinds of algebra that are particularly relevant to data science. There is elementary algebra, that's the kind we all learned, that's using calculating individual values, you multiply this, you add that. There is linear algebra also known as matrix algebra. This is at the core most of the calculations of statistical procedures, and then there is systems of linear equations.
These are critical to linear algebra and also to the practice of optimization which we'll take a closer look at. Right now I'm gonna do an example on data science salaries, and this actually comes from real data. I'm gonna make a particular equation that goes salary is equal to some constant plus years, this actually has to do with age, plus bargaining plus hours plus an error term. You could write it like this with abbreviations, but it's more common to write it like this.
Let me explain the terms that are here. On the far left is y sub i, that is the outcome on y, an outcome variable for person i, i is one, two, three, and so on. Next to it is beta sub zero, that's a Greek beta, and that stands for the intercept, the y intercept. Next to that is beta sub one, which is the regression coefficient for variable one, and next to that is x sub one i. That's the score on variable one for person i.
Then if we go way to the end we have epsilon sub i, that's the prediction error for person i. Now it's important in algebra to understand that there are several different structures available. Scalar is what we're used to dealing with. That's just a single number at a time but you can also use vectors, that's one row or one column of numbers and treated as a single unit. There is a matrix or matrices, and that's multiple rows and columns of data in a single object. When you use these you can rewrite the equation this way.
These are vectors and matrices put together. This vector on the left is the outcomes y for both cases one and two. This matrix here in the middle is all of the data for the individuals. The top row contains a one to be multiplied against a constant plus the scores on the three variables for case one. This is case two on the second row. This vertical column consists of the regression coefficients that's beta sub zero, beta sub one, and so on.
Then finally we have a smaller vector for the error terms for cases one and two. Let's run through this with a fictional example. Let's assume I have two data scientists, Fatima, who's 28 years old, has good bargaining skills, four on a scale of five, works 50 hours a week, has a salary of $118,000. Then there is Ezra, who's 34, has moderate bargaining skills, works 35 hours a week and has a salary of $84,000. What I'm going to do is I'm going to come back here and I'm going to calculate the salary for Fatima.
This is the format we're using here. Let me put in the actual numbers. Fatima's salary is $118,000. What I do now is I take each of these coefficients and I multiply them times her own values and I go from left or right in the data and top to bottom on the coefficients. I multiply these, then I multiply those, I multiply those, I multiply those, add those all up, and then i add on an error term. You can see, by the way, that Fatima's error term is really big.
The reason for that is even though I'm using real data, there were 35 other variables in the equation, one of which was for instance if you're the CEO then you would expect to make about $30,000 more per year. What's really nice is be able to put this into matrix notation. I'm able to take that entire collection of matrices and vectors and write it down like this, where the bold y is the vector of outcome variables, the bold x is the entire matrix of data from the individuals, the beta is the entire vector of regression coefficients, and the epsilon is the entire vector of error terms.
This makes it extremely compact and it's also how you would do it in most statistical programs. I'll show you the same idea in R. Here I am in the R and I'm going to be using the same data for my presentation. What I'm going to do is create a vector that has their outcome, their salaries. By the way it's customary to use lower case letters for vectors and upper case letter for matrices. I'm going to create a vector and I'll just bring it up.
You can see there is the vector. I had to use cbind to make it a column. Then I'll create a matrix and we'll bring up the matrix and there is exactly what we wanted. Then I'll get a vector b for the regression coefficients. Again using cbind to make it a column. That's important because whether things are columns or rows it makes a very big difference in matrix algebra as does the order of operations. A times B is not the same as B times A when you're dealing with matrices.
Now what I'm going to do is I'm going to get each person's predicted value on y, the outcome, by multiplying the matrix of their data by the vector of regression coefficients. When I do that, you see these are their predicted scores. Then I can get the error term by simply subtracting those predicted scores from their actual salaries. You can compare those to the presentations, the answers that we had, and those are spot-on. It's very easy to do these kinds of calculations in R.
What are our conclusions from this? Number one, algebra is critical. It's the central practice in data science. Two, matrices simplify notation, and third, that linear algebra is at the core of many, many, many of the procedures used in data science.
- The demand for data science
- Roles and careers
- Ethical issues in data science
- Sourcing data
- Exploring data through graphs and statistics
- Programming with R, Python, and SQL
- Data science in math and statistics
- Data science and machine learning
- Communicating with data