How to verify Pyspark dataframe column type?

PySpark, the Python API for Apache Spark, provides a powerful framework for big data processing. When working with PySpark DataFrames, verifying column data types is essential for data integrity and accurate operations. This article explores various methods to verify PySpark DataFrame column types with practical examples.

Overview of PySpark DataFrame Column Types

A PySpark DataFrame represents distributed data organized into named columns. Each column has a specific data type like IntegerType, StringType, BooleanType, etc. Understanding column types enables proper data operations and transformations.

Using the printSchema() Method

The printSchema() method displays the DataFrame's schema structure, showing column names, data types, and null constraints. It's the simplest way to verify column types.

Syntax

df.printSchema()

Example

Here we create a DataFrame with explicit schema definition and display its structure ?

from pyspark.sql import SparkSession
from pyspark.sql.types import IntegerType, StringType, DoubleType, StructType, StructField

# Create SparkSession
spark = SparkSession.builder.appName("ColumnTypeVerification").getOrCreate()

# Sample data
data = [
    (1, "John", 3.14),
    (2, "Jane", 2.71),
    (3, "Alice", 1.23)
]

# Define schema explicitly
schema = StructType([
    StructField("col1", IntegerType(), True),
    StructField("col2", StringType(), True),
    StructField("col3", DoubleType(), True)
])

# Create DataFrame with schema
df = spark.createDataFrame(data, schema)

# Print schema
df.printSchema()
root
 |-- col1: integer (nullable = true)
 |-- col2: string (nullable = true)
 |-- col3: double (nullable = true)

Inspecting Column Types with dtypes

The dtypes attribute returns a list of tuples containing column names and their data types. This provides programmatic access to column type information.

Syntax

column_types = df.dtypes
for column_name, data_type in column_types:
    print(f"Column '{column_name}' has data type: {data_type}")

Example

This example demonstrates retrieving and displaying column types programmatically ?

from pyspark.sql import SparkSession

# Create SparkSession
spark = SparkSession.builder.appName("ColumnTypes").getOrCreate()

# Sample data
data = [
    (1, "John", 3.14),
    (2, "Jane", 2.71),
    (3, "Alice", 1.23)
]

# Create DataFrame (PySpark infers types)
df = spark.createDataFrame(data, ["col1", "col2", "col3"])

# Get column types
column_types = df.dtypes

# Display column types
for column_name, data_type in column_types:
    print(f"Column '{column_name}' has data type: {data_type}")
Column 'col1' has data type: bigint
Column 'col2' has data type: string
Column 'col3' has data type: double

Verifying Column Types with selectExpr()

The selectExpr() method combined with typeof() function allows direct inspection of column data types within SQL expressions.

Syntax

column_names = ["col1", "col2", "col3"]
exprs = [f"typeof({col}) as {col}_type" for col in column_names]
df.selectExpr(*exprs).show()

Example

Using typeof() function to display data types as DataFrame columns ?

from pyspark.sql import SparkSession

# Create SparkSession
spark = SparkSession.builder.appName("TypeofExample").getOrCreate()

# Sample data
data = [
    (1, "John", 3.14),
    (2, "Jane", 2.71),
    (3, "Alice", 1.23)
]

# Create DataFrame
df = spark.createDataFrame(data, ["col1", "col2", "col3"])

# Verify column types using selectExpr()
column_names = ["col1", "col2", "col3"]
exprs = [f"typeof({col}) as {col}_type" for col in column_names]
df.selectExpr(*exprs).show()
+---------+---------+---------+
|col1_type|col2_type|col3_type|
+---------+---------+---------+
|   bigint|   string|   double|
|   bigint|   string|   double|
|   bigint|   string|   double|
+---------+---------+---------+

Checking Specific Column Type

You can check if a specific column matches an expected data type by accessing the schema directly ?

from pyspark.sql import SparkSession
from pyspark.sql.types import StringType

# Create SparkSession
spark = SparkSession.builder.appName("SpecificTypeCheck").getOrCreate()

# Sample data
data = [
    (1, "John", 3.14),
    (2, "Jane", 2.71),
    (3, "Alice", 1.23)
]

# Create DataFrame
df = spark.createDataFrame(data, ["col1", "col2", "col3"])

# Check if col2 is StringType
col2_type = df.schema["col2"].dataType
print(f"col2 data type: {col2_type}")
print(f"Is col2 StringType? {isinstance(col2_type, StringType)}")
col2 data type: StringType()
Is col2 StringType? True

Comparison of Methods

Method Output Format Best For
printSchema() Tree structure Quick visual inspection
dtypes List of tuples Programmatic access
selectExpr() DataFrame output SQL-style verification
schema StructType object Type validation logic

Conclusion

Verifying PySpark DataFrame column types is crucial for data integrity and proper operations. Use printSchema() for quick inspection, dtypes for programmatic access, and selectExpr() with typeof() for SQL-style verification.

Updated on: 2026-03-27T15:19:30+05:30

2K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements