How to Change Column Type in PySpark Dataframe

PySpark DataFrames are powerful data structures for big data processing. One common task when working with DataFrames is changing column data types to ensure data consistency, perform accurate calculations, and optimize memory usage.

In this tutorial, we will explore three methods to change column types in PySpark DataFrames: using cast(), withColumn(), and SQL expressions.

Method 1: Using cast() Function

The cast() function is the most straightforward way to convert a column from one data type to another. It takes the desired data type as an argument and returns a new column with the modified type.

Syntax

df.withColumn("column_name", df["column_name"].cast("desired_data_type"))

Example

Let's convert a string column to integer type ?

from pyspark.sql import SparkSession

# Initialize Spark session
spark = SparkSession.builder.appName("ColumnTypeChange").getOrCreate()

# Creating a DataFrame with a string column
data = [("Prince", "25"), ("Mukul", "30"), ("Rohit", "35")]
df = spark.createDataFrame(data, ["name", "age"])

print("Original DataFrame:")
df.show()
print("Original Schema:")
df.printSchema()

# Converting the "age" column from string to integer
df_converted = df.withColumn("age", df["age"].cast("integer"))

print("After conversion:")
df_converted.show()
print("Updated Schema:")
df_converted.printSchema()
Original DataFrame:
+------+---+
|  name|age|
+------+---+
|Prince| 25|
| Mukul| 30|
| Rohit| 35|
+------+---+

Original Schema:
root
 |-- name: string (nullable = true)
 |-- age: string (nullable = true)

After conversion:
+------+---+
|  name|age|
+------+---+
|Prince| 25|
| Mukul| 30|
| Rohit| 35|
+------+---+

Updated Schema:
root
 |-- name: string (nullable = true)
 |-- age: integer (nullable = true)

Method 2: Using withColumn() Function

The withColumn() function creates a new column or replaces an existing column. When combined with cast(), it provides the same functionality as Method 1 but with more explicit syntax.

Example

Let's convert a string price column to float type ?

from pyspark.sql import SparkSession

# Initialize Spark session
spark = SparkSession.builder.appName("ColumnTypeChange").getOrCreate()

# Creating a DataFrame with a string column
data = [("Apple", "2.99"), ("Orange", "1.99"), ("Banana", "0.99")]
df = spark.createDataFrame(data, ["product", "price"])

print("Original DataFrame:")
df.show()

# Converting the "price" column from string to float
df_converted = df.withColumn("price", df["price"].cast("float"))

print("After conversion:")
df_converted.show()
print("Updated Schema:")
df_converted.printSchema()
Original DataFrame:
+-------+-----+
|product|price|
+-------+-----+
|  Apple| 2.99|
| Orange| 1.99|
| Banana| 0.99|
+-------+-----+

After conversion:
+-------+-----+
|product|price|
+-------+-----+
|  Apple| 2.99|
| Orange| 1.99|
| Banana| 0.99|
+-------+-----+

Updated Schema:
root
 |-- product: string (nullable = true)
 |-- price: float (nullable = true)

Method 3: Using SQL Expressions

SQL expressions provide a powerful way to change column types using familiar SQL syntax. The expr() function allows you to write SQL-like expressions within PySpark.

Example

Let's use SQL CAST expression to convert column types ?

from pyspark.sql import SparkSession
from pyspark.sql.functions import expr

# Initialize Spark session
spark = SparkSession.builder.appName("ColumnTypeChange").getOrCreate()

# Creating a DataFrame with a string column
data = [("Prince", "25"), ("Mukul", "30"), ("Rohit", "35")]
df = spark.createDataFrame(data, ["name", "age"])

print("Original DataFrame:")
df.show()

# Converting the "age" column from string to integer using SQL expressions
df_converted = df.select("name", expr("CAST(age AS INT) AS age"))

print("After conversion:")
df_converted.show()
print("Updated Schema:")
df_converted.printSchema()
Original DataFrame:
+------+---+
|  name|age|
+------+---+
|Prince| 25|
| Mukul| 30|
| Rohit| 35|
+------+---+

After conversion:
+------+---+
|  name|age|
+------+---+
|Prince| 25|
| Mukul| 30|
| Rohit| 35|
+------+---+

Updated Schema:
root
 |-- name: string (nullable = true)
 |-- age: integer (nullable = true)

Common Data Types in PySpark

Here are the commonly used data types for casting:

Data Type PySpark Type SQL Type
Integer "integer" or "int" INT
Float "float" FLOAT
Double "double" DOUBLE
String "string" STRING
Boolean "boolean" BOOLEAN

Conclusion

All three methods achieve the same result: changing column data types in PySpark DataFrames. Use cast() for simple type conversions, withColumn() for explicit column operations, and SQL expressions for complex transformations. Choose the method that best fits your coding style and requirements.

Updated on: 2026-03-27T09:06:21+05:30

9K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements