From the course: Apache PySpark by Example

Built-in functions - Spark DataFrames Tutorial

From the course: Apache PySpark by Example

Start my 1-month free trial

Built-in functions

- [Instructor] With PySpark, the built in functions are available from the PySpark.sql.functions library. All you need to do is to import them so that you can use them. With pandas, many of these functions are available to you and you just need to enter the data frame, the series or column name, and then the function. So if you want to do some string operations in PySpark, there are options for substring, upper, and lower. If you're looking at date functions then you can get the day of the week, the day of the month, and so on. And if you're looking at mathematical functions then you have the option for the sine, the cosine, log and so on. We can also create our own functions called user defined functions and we'll get into this later. Let's head over to our notebook to look at a few examples. So we've listed the first five rows of our data frame. Let's head over to the built-in function section. So remember that we need to import the functions that we want, so from PySpark.sql import functions and I'm just going to show you all of the functions available to you. So, for example, you can see you've got the column, the data frame, string type, and you've got functions such as arctan, arcsin, and so on. So let's look at some of these string functions available to us. So display the primary type column, in lower and upper characters, and the first four characters of the column. So I have no idea what these functions are called, so let's take a look at the library and make a guess. So from pyspark.sql .functions, import. And I'm guessing lower is going to be called lower, so that's probably going to be our lower function. Again, to be able to convert characters to upper, upper sounds like a sensible guess. And it looks like, if we want to capture the first four characters of the column, we want something around the substring function, so I'm going to select sub, and see if there's an option there. And it looks like we have a substring function. Now, I'm just going to add an additional cell of code here. Now, because I don't know how to use some of these functions, I'm going to try and use the help function, so help and substring will give me documentation on the substring function. So you can see that, if you want to use the substring function, you enter the string as the first argument, position, and then the length. The important thing to note, is that the position is not zero-based, but one-based. So that means the first character is 1, and not zero, as is common in many programming languages. So I want to confirm, firstly, that the primary type is in fact a string, so I'm going to go to printSchema, so rc.printSchema. And you can see that the primary type is in fact a string, so let's head down now, and add an additional cell. And let's now display the primary type column in lower and upper characters, and the first four characters of the column. So let's now select the first couple of rows of the primary type column, so rc.select, lower, off the column Primary Type. I then want the upper off the column Primary Type, and substring. I need to enter the first argument as the string that I want to look at, so Primary Type. The next argument needs to be the starting position, which is 1 in this case, and we want the first four characters, 1 and 4, and I want to see the first five rows. And we can see that the first column is the first five rows, with primary type in lowercase, the next column is the primary type in uppercase, and the fourth column is the first four characters of the primary type, which we know is in uppercase. So let's take a look at the next question. Show the oldest date and the most recent date. So what we'd want to use here, for these numeric functions, is we'll need to import the min and max functions, so from pyspark, sql.functions, import min and max. Now if I want to get the oldest date, I actually want the min value for the date column, and if I want the most recent date, I actually want the maximum value for the date column. So, rc.select. And the oldest date is going to have a min value, so min of column Date, and a max for column Date. Right, and there we have it. The oldest date is the 1st of January, 2001, and not surprisingly, the most recent date is going to be something on the 11th of November, 2018, and we find that that's actually at 11:50 pm. Okay, now let's head over to the date functions and look at this question. So, What is 3 days earlier than the oldest date and 3 days later than the most recent date? Now, this is the beauty of PySpark. I don't need to understand all of the functions, but I can probably hazard a guess that this is related to a date function. So I'm going to go to the PySpark functions library, so from pyspark.sql.functions, import, and I'm guessing it's something to do with date, so we've got a couple of options here, date add, date format, date sub, and date difference. So I would assume that it's probably going to be the date add and the date sub that we want here. So let's select the date add and the date sub, and import them, and we can take a quick look at the documentation, so help, date add. And we can see that this returns the date that is the day numbers after the start date. So if I want to get the three days earlier than the oldest date and three days later than the most recent date, I do rc.select, and three days earlier than the oldest date is going to be date_sub. And we want the minimum, which is what we calculated earlier by three days. And if we want three days later than the most recent date, then it's date_add, and that's the maximum of the date column, again by three days. And before we get the output for this, we're expecting it to be three days earlier than the 1st of January, 2001. So counting three days back is 2012, so that's the 29th of December, 2000. And in terms of the three days later than the most recent date, so the most recent date is the 11th of November, 2018, and so we're expecting to get back the 14th of November, 2018. So let's run this query. Now, I need to actually specify the columns that I'm looking at, so I need column date here. And I need a column for this date function. And there we have it. So in the next video, we will look at delving into working with dates in a little bit more detail.

Contents