How to Change Column Type in PySpark Dataframe


Python is a versatile and powerful programming language that has gained immense popularity in the field of data analysis and processing. With its extensive range of libraries and frameworks, Python provides developers with robust tools to handle complex data operations efficiently. PySpark, a Python API for Apache Spark, takes Python's capabilities to the next level by offering distributed computing capabilities for big data processing. One of the fundamental components of PySpark is the DataFrame, a tabular data structure that allows for seamless manipulation and analysis of large datasets.

In this tutorial, we will explore an essential aspect of working with PySpark DataFrames: changing column types. Understanding and modifying column types is crucial when it comes to data transformation, validation, and analysis. By altering the data types of specific columns, we can ensure data consistency, perform calculations accurately, and optimize memory usage. In the next section of the article, we will delve into the various methods available in PySpark for changing column types and discuss their advantages and limitations

Method 1: Change Column Type in PySpark DataframeUsing the cast() function

In this section, we will explore the first method to change column types in PySpark DataFrame: using the cast() function. The cast() function allows us to convert a column from one data type to another, facilitating data transformation and manipulation.

The cast() function in PySpark DataFrame is used to explicitly change the data type of a column. It takes the desired data type as an argument and returns a new DataFrame with the modified column type. The cast() function is especially useful when we want to convert a column to a specific type for performing operations or when the column type needs to be aligned with downstream processing requirements.

Here's the syntax for using the cast() function:

df.withColumn("new_column_name", df["column_name"].cast("desired_data_type"))

Let's consider an example where we have a DataFrame with a column named "age" of type string, and we want to convert it to an integer type using the cast() function.

Example

# Creating a data frame with a string column
data = [("Prine", "25"), ("Mukul", "30"), ("Rohit", "35")]
df = spark.createDataFrame(data, ["name", "age"])
df.show()

# Converting the "age" column from string to integer
df = df.withColumn("age", df["age"].cast("integer"))
df.printSchema()

Output

+-----+---+
| name|age|
+-----+---+
| Prince| 25|
| Mukul| 30|
|  Rohit| 35|
+-----+---+

root
 |-- name: string (nullable = true)
 |-- age: integer (nullable = true)

In the above example, we used the cast() function to change the "age" column's data type from string to integer. The resulting DataFrame has the modified column type, as shown in the printed schema.

Method 2: Change Column Type in PySpark Dataframe Using the withColumn() function

In this section, we will explore another method to change column types in PySpark DataFrame: using the withColumn() function. The withColumn() function allows us to create a new column with the desired data type while retaining the existing columns in the DataFrame.

The withColumn() function takes two arguments: the column name and an expression that defines the column's values. By specifying the desired data type in the expression, we can effectively change the column type.

Here's the syntax for using the withColumn() function to change column types:

df.withColumn("new_column_name", expression)

Let's consider an example where we have a DataFrame with a column named "price" of type string, and we want to convert it to a float type using the withColumn() function.

Example

# Creating a data frame with a string column
data = [("Apple", "2.99"), ("Orange", "1.99"), ("Banana", "0.99")]
df = spark.createDataFrame(data, ["product", "price"])
df.show()

# Converting the "price" column from string to float
df = df.withColumn("price", df["price"].cast("float"))
df.printSchema()

Output

+-------+-----+
|product|price|
+-------+-----+
|  Apple| 2.99|
| Orange| 1.99|
| Banana| 0.99|
+-------+-----+

root
 |-- product: string (nullable = true)
 |-- price: float (nullable = true)

In the above example, we used the withColumn() function to create a new column named "price" with the modified data type. The resulting DataFrame has the updated column type, as shown in the printed schema.

Method 3: Change Column Type in PySpark Dataframe Using SQL expressions

In this section, we will explore the last and most powerful method to change column types in PySpark DataFrame: using SQL expressions. SQL expressions in PySpark allow us to leverage the expressive power of SQL queries to perform various operations, including type conversions.

SQL expressions in PySpark provide a convenient and familiar way to manipulate data within DataFrames. These expressions resemble standard SQL syntax and enable us to perform complex computations, aggregations, and transformations on our data.

To change column types using SQL expressions, we can utilize the `select()` function along with the `expr()` function to define the desired data type. The `expr()` function allows us to write SQL−like expressions within PySpark, making it straightforward to manipulate column values and change their types.

Here's an example that demonstrates how to change the column type using SQL expressions:

Example

from pyspark.sql.functions import expr

# Creating a data frame with a string column
data = [("Prince", "25"), ("Mukul", "30"), ("Rohit", "35")]
df = spark.createDataFrame(data, ["name", "age"])
df.show()

# Converting the "age" column from string to integer using SQL expressions
df = df.select("name", expr("CAST(age AS INT) AS age"))
df.printSchema()

Output

+-----+---+
| name|age|
+-----+---+
| Prince| 25|
| Mukul| 30|
|  Rohit| 35|
+-----+---+

root
 |-- name: string (nullable = true)
 |-- age: integer (nullable = true)

In the above example, we used the `select()` function along with the `expr()` function to change the column type. We applied the SQL expression `CAST(age AS INT)` within the `expr()` function to convert the "age" column from string to integer. The resulting DataFrame has the modified column type, as shown in the printed schema.

SQL expressions are particularly useful when you need to perform advanced data manipulation or combine multiple operations in a single statement. They allow for fine−grained control over column transformations and are highly efficient for large−scale data processing.

Conclusion

In this tutorial, we explored the various methods available in PySpark for changing column types in a data frame. We provided examples for each method to make it easier for you to understand and apply them in your own projects. First, we discussed the `cast()` function, which allows us to explicitly convert a column from one data type to another. Next, we explored the `withColumn()` function, which enables us to create a new column with the desired data type while retaining the existing columns in the DataFrame. Lastly, we introduced SQL expressions in PySpark, which provide a powerful way to manipulate data within DataFrames. We showcased how to leverage SQL expressions to change column types by using the `select()` function along with the `expr()` function. By understanding and utilizing these methods, you can ensure data consistency, perform accurate calculations, and optimize memory usage in your PySpark projects.

Updated on: 20-Jul-2023

3K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements