Article Categories

Selected Reading

How to verify Pyspark dataframe column type?

Python PySpark Server Side Programming Programming

PySpark, the Python API for Apache Spark, provides a powerful framework for big data processing. When working with PySpark DataFrames, verifying column data types is essential for data integrity and accurate operations. This article explores various methods to verify PySpark DataFrame column types with practical examples.

Overview of PySpark DataFrame Column Types

A PySpark DataFrame represents distributed data organized into named columns. Each column has a specific data type like IntegerType, StringType, BooleanType, etc. Understanding column types enables proper data operations and transformations.

Using the printSchema() Method

The printSchema() method displays the DataFrame's schema structure, showing column names, data types, and null constraints. It's the simplest way to verify column types.

Syntax

df.printSchema()

Example

Here we create a DataFrame with explicit schema definition and display its structure ?

from pyspark.sql import SparkSession
from pyspark.sql.types import IntegerType, StringType, DoubleType, StructType, StructField

# Create SparkSession
spark = SparkSession.builder.appName("ColumnTypeVerification").getOrCreate()

# Sample data
data = [
    (1, "John", 3.14),
    (2, "Jane", 2.71),
    (3, "Alice", 1.23)
]

# Define schema explicitly
schema = StructType([
    StructField("col1", IntegerType(), True),
    StructField("col2", StringType(), True),
    StructField("col3", DoubleType(), True)
])

# Create DataFrame with schema
df = spark.createDataFrame(data, schema)

# Print schema
df.printSchema()

root
 |-- col1: integer (nullable = true)
 |-- col2: string (nullable = true)
 |-- col3: double (nullable = true)

Inspecting Column Types with dtypes

The dtypes attribute returns a list of tuples containing column names and their data types. This provides programmatic access to column type information.

Syntax

column_types = df.dtypes
for column_name, data_type in column_types:
    print(f"Column '{column_name}' has data type: {data_type}")

Example

This example demonstrates retrieving and displaying column types programmatically ?

from pyspark.sql import SparkSession

# Create SparkSession
spark = SparkSession.builder.appName("ColumnTypes").getOrCreate()

# Sample data
data = [
    (1, "John", 3.14),
    (2, "Jane", 2.71),
    (3, "Alice", 1.23)
]

# Create DataFrame (PySpark infers types)
df = spark.createDataFrame(data, ["col1", "col2", "col3"])

# Get column types
column_types = df.dtypes

# Display column types
for column_name, data_type in column_types:
    print(f"Column '{column_name}' has data type: {data_type}")

Column 'col1' has data type: bigint
Column 'col2' has data type: string
Column 'col3' has data type: double

Verifying Column Types with selectExpr()

The selectExpr() method combined with typeof() function allows direct inspection of column data types within SQL expressions.

Syntax

column_names = ["col1", "col2", "col3"]
exprs = [f"typeof({col}) as {col}_type" for col in column_names]
df.selectExpr(*exprs).show()

Example

Using typeof() function to display data types as DataFrame columns ?

from pyspark.sql import SparkSession

# Create SparkSession
spark = SparkSession.builder.appName("TypeofExample").getOrCreate()

# Sample data
data = [
    (1, "John", 3.14),
    (2, "Jane", 2.71),
    (3, "Alice", 1.23)
]

# Create DataFrame
df = spark.createDataFrame(data, ["col1", "col2", "col3"])

# Verify column types using selectExpr()
column_names = ["col1", "col2", "col3"]
exprs = [f"typeof({col}) as {col}_type" for col in column_names]
df.selectExpr(*exprs).show()

+---------+---------+---------+
|col1_type|col2_type|col3_type|
+---------+---------+---------+
|   bigint|   string|   double|
|   bigint|   string|   double|
|   bigint|   string|   double|
+---------+---------+---------+

Checking Specific Column Type

You can check if a specific column matches an expected data type by accessing the schema directly ?

from pyspark.sql import SparkSession
from pyspark.sql.types import StringType

# Create SparkSession
spark = SparkSession.builder.appName("SpecificTypeCheck").getOrCreate()

# Sample data
data = [
    (1, "John", 3.14),
    (2, "Jane", 2.71),
    (3, "Alice", 1.23)
]

# Create DataFrame
df = spark.createDataFrame(data, ["col1", "col2", "col3"])

# Check if col2 is StringType
col2_type = df.schema["col2"].dataType
print(f"col2 data type: {col2_type}")
print(f"Is col2 StringType? {isinstance(col2_type, StringType)}")

col2 data type: StringType()
Is col2 StringType? True

Comparison of Methods

Method	Output Format	Best For
`printSchema()`	Tree structure	Quick visual inspection
`dtypes`	List of tuples	Programmatic access
`selectExpr()`	DataFrame output	SQL-style verification
`schema`	StructType object	Type validation logic

Conclusion

Verifying PySpark DataFrame column types is crucial for data integrity and proper operations. Use printSchema() for quick inspection, dtypes for programmatic access, and selectExpr() with typeof() for SQL-style verification.

Rohan Singh

Updated on: 2026-03-27T15:19:30+05:30

2K+ Views

Previous Next

Article Categories

How to verify Pyspark dataframe column type?

Overview of PySpark DataFrame Column Types

Using the printSchema() Method

Syntax

Example

Inspecting Column Types with dtypes

Syntax

Example

Verifying Column Types with selectExpr()

Syntax

Example

Checking Specific Column Type

Comparison of Methods

Conclusion

Learn More in Our Tutorials