Learn how to perform regression analysis using Python and how to interpret the results.
- [Instructor] We're going to build on what we learned in the last video. Now with Python, we've gone ahead and we've opened up the application, and we've set up our notebook by bringing in our packages. This time there are a few of them. Go ahead and execute this code by running the cell. If you recall, we'll select that cell and then shift + enter. This has brought our packages into the platform and has set us up to begin our work with the analysis here. Now let's connect to our data source. In this case, we'll be connecting to our CSV file.
Again, we've already written that line in, and we're just going to go ahead and run it with a shift and a return. Let's see a quick snapshot of our data, just to insure we have access to what we thought we should have access to. I'm going to type in this variable that we assigned to our data frame here. I'm going to run the command head(), and I'm just going to input three here as a configuration. That'll just give us the first three rows of our data. You can see we got something very similar to what we saw in the last video, in terms of our dataset.
There have been a few transformations to this, so we don't have quite as many data points or quite as many columns here, but the general gist of what we have here is very similar. That looks good. Now let's plot our data. Similar to the last video, we're going to assign broadcast to our X axis, which is synonymous with what our independent variable is, and we will assign sales to our Y axis, which is synonymous with what our dependent variable is.
Let's go ahead and plot that data. The way we do that, I'm going to go ahead and paste in that variable name from our data frame, and I'm going to run the plot() function, so dot plot. Now we need to specify which type of plot we're going to run. We do that by typing kind, in single quotes, scatter, so it is a scatter plot that we're running here. Next, we want to assign the data for our X axis to broadcast.
The way we do that is X equals single-quote broadcast, which we can see right here is the name of that specific column with this dataset, and now Y equals sales, and again we can see that in the printout of our data above. With that, I'm going to go ahead and run this. That brings a data visualization into our notebook. This is a scatter plot, just like we saw in the last video.
Before I do any additional analysis, I want to introduce a new concept, that of R squared. R squared is a statistical measure of how close the data is to that fitted line. It is also known as the coefficient of determination. R squared values are always between zero and one, but let's interpret that as between 0% and 100%. Zero indicates that we can infer no correlation between our dependent and independent variables, while 100% means that we can infer a significant correlation between the two.
In general, the higher the R squared, the better the model fits your data. I have already included the line of code that you will need here. Let me scroll down and bring that into view here. What this is doing, this particular line right here, what this is doing is it's feeding our stats package algorithm that we included in cell one above. It's feeding the values that it needs to calculate R squared, so that it's taking the slope and the intercept, for example, and it's using those as inputs into that algorithm.
We of course need to assign our X and our Y values, which we've done right here, with broadcast and sales. Let's go ahead and run that command. Then we'll print out R squared for assessment. We type out Python's print() command, which is simply print and then parentheses, and I'm going to type in, we can type in whatever makes the most sense here as a label, so I'll just do R squared colon. This is where we're going to print out what the R squared value is, and then I'm going to feed that R underscore value, two asterisks for an operator, and the number two.
That's how we calculate R squared. Let's go ahead and run that. This value indicates a relatively strong relationship between our broadcast advertising and sales, so it's another way to interpret the model. Now, let's create some parity between what we're doing here in Python and what we did in R. We can also calculate the coefficients here. We can do so by running an OLS, or what's known as an ordinary least-squares regression, which is what we did in R.
We do that this way. Let me go ahead and type this in here. I'm going to assign it a variable name of myLinearModel, equals, and then I'm going to type in smf.ols. This is assigning the myLinearModel to the command from our stats model package, so that's the smf, and then the ordinary least squares component of that, so I'm going to add in a formula, and assign this formula now.
We're going to assign it our dependent and independent variables, so first for the dependent, so that's going to be sales, tilde, and then the predictor, which is broadcast or our X value or our independent variable there, so sales tilde broadcast, and then we need to assign it our specific data frame, which again, from above, that was myRegressionData, and then we want to assign the fitted for that fitted line.
Let me run this. That information is there; we just need to reveal it so we can output those values this way. I'm just going to call the variable here and then run the params() function on that. That will basically reveal the parameters that are nested within this myLinearModel variable here, so enter that. We have values here that are similar to what we saw in our regression in R. This says that for each increase in broadcast unit, there is an increase of $12,141.93 in net sales.
So we've done a couple things here. We modeled our data, we generated and assessed R squared, and we ran an OLS regression analysis to determine the relationship with some specificity. We now have the necessary information to begin to model our broadcast's return on investment.
In this course, discover how to gain valuable insights from large data sets using specific languages and tools. Follow Chris DallaVilla as he walks through how to use R, Python, and Tableau to perform data modeling and assess performance. As Chris dives into these concepts, he shares specific case studies that come directly from his own work with clients. Plus, he shares three essential—and practical—best practices for data-driven marketing that you can use to bolster your organization's marketing performance.
- Installing R, Python, and Tableau
- Navigating the UI for R, Python, and Tableau
- Using R, Python, and Tableau
- Exploratory analysis
- Performing regression analysis
- Performing a cluster analysis
- Performing a conjoint assessment
- Stakeholder alignment