From the course: Apache PySpark by Example

Unlock the full course today

Join today to access over 22,600 courses taught by industry experts or purchase this course individually.

Schemas

Schemas - Spark DataFrames Tutorial

From the course: Apache PySpark by Example

Start my 1-month free trial

Schemas

- [Narrator] A schema defines the column names, and then what data type they are. So, for example, are the columns going to have IntegerTypes, StringTypes, DayTypes, and so on. In Spark, we just type df.types or df.printSchema. It's very similar to what you would have done in pandas, with the dftypes. Spark can infer the schema by default. Spark takes a look at a couple of rows of the data, and tries to determine what kind of column each should be. What I found is that in a production environment, you want to explicitly define your schemas. Defining schemas in Spark is easy. You just need to remember a couple of things. You need to import the different types from pyspark.sql.types. A schema is a StructType made up of a number of fields of type StructFields, so each of the StructFields have three components. That's the name of the column, the type of that column, so is it a string, a float, and so on. And finally, if that column can contain missing or null values. You can also…

Contents