Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
How to Change Column Type in PySpark Dataframe
PySpark DataFrames are powerful data structures for big data processing. One common task when working with DataFrames is changing column data types to ensure data consistency, perform accurate calculations, and optimize memory usage.
In this tutorial, we will explore three methods to change column types in PySpark DataFrames: using cast(), withColumn(), and SQL expressions.
Method 1: Using cast() Function
The cast() function is the most straightforward way to convert a column from one data type to another. It takes the desired data type as an argument and returns a new column with the modified type.
Syntax
df.withColumn("column_name", df["column_name"].cast("desired_data_type"))
Example
Let's convert a string column to integer type ?
from pyspark.sql import SparkSession
# Initialize Spark session
spark = SparkSession.builder.appName("ColumnTypeChange").getOrCreate()
# Creating a DataFrame with a string column
data = [("Prince", "25"), ("Mukul", "30"), ("Rohit", "35")]
df = spark.createDataFrame(data, ["name", "age"])
print("Original DataFrame:")
df.show()
print("Original Schema:")
df.printSchema()
# Converting the "age" column from string to integer
df_converted = df.withColumn("age", df["age"].cast("integer"))
print("After conversion:")
df_converted.show()
print("Updated Schema:")
df_converted.printSchema()
Original DataFrame: +------+---+ | name|age| +------+---+ |Prince| 25| | Mukul| 30| | Rohit| 35| +------+---+ Original Schema: root |-- name: string (nullable = true) |-- age: string (nullable = true) After conversion: +------+---+ | name|age| +------+---+ |Prince| 25| | Mukul| 30| | Rohit| 35| +------+---+ Updated Schema: root |-- name: string (nullable = true) |-- age: integer (nullable = true)
Method 2: Using withColumn() Function
The withColumn() function creates a new column or replaces an existing column. When combined with cast(), it provides the same functionality as Method 1 but with more explicit syntax.
Example
Let's convert a string price column to float type ?
from pyspark.sql import SparkSession
# Initialize Spark session
spark = SparkSession.builder.appName("ColumnTypeChange").getOrCreate()
# Creating a DataFrame with a string column
data = [("Apple", "2.99"), ("Orange", "1.99"), ("Banana", "0.99")]
df = spark.createDataFrame(data, ["product", "price"])
print("Original DataFrame:")
df.show()
# Converting the "price" column from string to float
df_converted = df.withColumn("price", df["price"].cast("float"))
print("After conversion:")
df_converted.show()
print("Updated Schema:")
df_converted.printSchema()
Original DataFrame: +-------+-----+ |product|price| +-------+-----+ | Apple| 2.99| | Orange| 1.99| | Banana| 0.99| +-------+-----+ After conversion: +-------+-----+ |product|price| +-------+-----+ | Apple| 2.99| | Orange| 1.99| | Banana| 0.99| +-------+-----+ Updated Schema: root |-- product: string (nullable = true) |-- price: float (nullable = true)
Method 3: Using SQL Expressions
SQL expressions provide a powerful way to change column types using familiar SQL syntax. The expr() function allows you to write SQL-like expressions within PySpark.
Example
Let's use SQL CAST expression to convert column types ?
from pyspark.sql import SparkSession
from pyspark.sql.functions import expr
# Initialize Spark session
spark = SparkSession.builder.appName("ColumnTypeChange").getOrCreate()
# Creating a DataFrame with a string column
data = [("Prince", "25"), ("Mukul", "30"), ("Rohit", "35")]
df = spark.createDataFrame(data, ["name", "age"])
print("Original DataFrame:")
df.show()
# Converting the "age" column from string to integer using SQL expressions
df_converted = df.select("name", expr("CAST(age AS INT) AS age"))
print("After conversion:")
df_converted.show()
print("Updated Schema:")
df_converted.printSchema()
Original DataFrame: +------+---+ | name|age| +------+---+ |Prince| 25| | Mukul| 30| | Rohit| 35| +------+---+ After conversion: +------+---+ | name|age| +------+---+ |Prince| 25| | Mukul| 30| | Rohit| 35| +------+---+ Updated Schema: root |-- name: string (nullable = true) |-- age: integer (nullable = true)
Common Data Types in PySpark
Here are the commonly used data types for casting:
| Data Type | PySpark Type | SQL Type |
|---|---|---|
| Integer | "integer" or "int" | INT |
| Float | "float" | FLOAT |
| Double | "double" | DOUBLE |
| String | "string" | STRING |
| Boolean | "boolean" | BOOLEAN |
Conclusion
All three methods achieve the same result: changing column data types in PySpark DataFrames. Use cast() for simple type conversions, withColumn() for explicit column operations, and SQL expressions for complex transformations. Choose the method that best fits your coding style and requirements.
