From the course: SQL Server Machine Learning Services: Python

Manipulate a data frame

From the course: SQL Server Machine Learning Services: Python

Start my 1-month free trial

Manipulate a data frame

- [Instructor] The easiest way to start working with a data set in Python is to first figure out the select query that will retrieve the data from your database that you want to work with. I'm going to go ahead and take a look inside of the Wide World Importers database and in the tables folder we'll see that there's a table called application.cities. I'm going to create a new query to pull some information out of this table. I'm going to pull out the distinct city name and state province IDs from it, and I also want to order the records by the city name values. I can then execute the query. I get the results down below, and this is the data that I want to work with in Python. Now we can go ahead and select all three of these lines and cut them to my clipboard. Now we can start writing the Python script and we're going to start with the same empty script template that we've been working with. We'll start with E-X-E-C, or execute, sp.execute_external_script, then we have our two parameters here, at language, and that's going to be set to the value of Python. Next up, we have the script parameter. And I'm just going to leave that one blank here for the time being. And now we're going to add one more parameter. So I'm going to type in a comma after the close of the script parameter, and we'll come down to line number six and the next parameter is called @input_data_one. Remember to include underscores between the words input, data and one. I'm going to set its value, it's also a Unicode character, so I'm going to type in the capital letter N and open up a single quotation mark. And now I can paste in the contents of the select statement that we just generated. I'll finish the parameter by closing it with a single quotation mark here at the very end. The InputDataOne property is going to take the results of the select statement, and convert it into a pandas data frame and store it in a variable called InputDataSet. Now we can use the data in the Python script by referring to the InputDataSet variable. Let's start by simply printing the contents of the data frame to the messages window. I'll come back up here to line number four and I'm going to use the print function, and I want to print InputDataSet. The variable name doesn't have any spaces or underscores in it, and it uses a capital letter I, capital D, and a capital S. Let's execute the Python script and see what the results look like. It'll open up the messages window and I can see that I have a column for the city name and a column for the state province ID data. On the left are integers that are used to identify each row of the data frame. Notice that the first row is zero. Python uses zero-based indexing and we can use these index numbers to pull individual rows from the data frame using pandas data frame properties. One property is called iloc, or i-l-o-c, which stands for integer location. You can use the iloc property by typing a period after the data frame name which in our case is InputDataSet, type the period there, and then i-l-o-c. After that, I need to open up two sets of square brackets followed by the integer that represents the index position of the record that I'd like to return. How about number seven? We'll close the two square brackets, and now I can execute this script. That returns just the single row from the data frame. We can retrieve multiple rows by separating the index integers with a comma, so come back up here after seven, I'll type in a comma, how about also number 10, and number 15. We'll execute the commands here and I get those three records returned. We can also retrieve a range of rows by using only a single set of square brackets. Now highlight all of this here and get rid of it, and this time I'll type in the range of five to 15. To specify the range we're going to use a colon between those two integers. I'll execute the statement and we get the results here. Notice that the range is not inclusive. It starts at index number five, but it does not include index 15, it stops at 14. This is similar behavior to what we saw with the four loop in the last chapter. Now if you're just interested in seeing the first several rows from the data frame, you can use the head function instead of iloc. Go ahead and highlight all of this here, and instead of InputDataSet.iloc, I'm going to type in InputDataSet.head. Since this is a function, dataframe.head uses parentheses to define how many rows to retrieve from the beginning of the data frame. Let's retrieve the first 10 rows. I'll type in a 10 in parenthesis, we'll close the parenthesis and we'll have that second closing parenthesis here for the print function. We'll execute statement and we get the first 10 rows return from the data frame. For more general information about the data frame, you can use a property called columns to see just what the column labels are. Let's change this print statement to InputDataSet.columns. I'll execute it and I'll see that the two columns are city name and state province ID. And finally, you might also want to know how many rows and columns are in the data frame. You can get that information with the shape property. Now I'll print InputDataSet.shape and in the messages window we'll see that we have 37,940 rows and two columns. So there's a number of different ways that we can start working with the data frame in a Python script. Once you've loaded the InputDataSet variable from a sequal server select statement using the InputDataOne property, you can then manipulate it with the properties and functions available in the pandas library.

Contents