How to get name of dataframe column in PySpark?


A named collection of data values that are arranged in a tabular fashion constitutes a dataframe column in PySpark. An individual variable or attribute of the data, such as a person's age, a product's price, or a customer's location, is represented by a column.

Using the withColumn method, you can add columns to PySpark dataframes. This method enables you to name the new column and specify the rules for generating its values. Following the creation of a column, you can use it to carry out a number of operations on the data, including filtering, grouping, and aggregating. This enables quicker and more effective data analysis because columns in PySpark dataframes are analysed in parallel across multiple nodes.

Algorithms to get Name of Dataframe Column in PySpark

To obtain the name of a dataframe column in PySpark, you should follow the following techniques and steps −

Step1 − A named collection of data values that are arranged in a tabular fashion constitutes a dataframe column in PySpark. An individual variable or attribute of the data, such as a person's age, a product's price, or a customer's location, is represented by a column.

Step2 − The columns property in PySpark returns a list of all the column names in the dataframe and can be used to retrieve the name of a dataframe column. Since no additional calculations or transformations are necessary, this method is straightforward and effective.

Step3 − Use the select method with the column name as an input to obtain the name of a certain dataframe column in another way. In order to extract the column name as a string using the columns attribute, this function returns a new dataframe that only contains the selected column.

Step4 − The printSchema method in PySpark, which shows the dataframe's schema in a tree-like fashion, is a third methodology for obtaining the name of a column in a dataframe. This approach makes it simple to determine the names of certain columns by displaying the names and data types of each column in the dataframe.

Step5 − An overview of the dataframe's statistics, together with the names of all the columns, can also be obtained using the describe method, which is the last option. Using the columns attribute, a list of strings can be generated as a result of this method's return of a new dataframe containing statistical data about each column.

Syntax

df.columns

The names of the columns can also be obtained from the list of structural fields, which can then be used to retrieve the names of the columns.

Syntax

df.schema.fields

Approaches

Approach 1

We are using the columns function to obtain the names of the columns that are present in the Dataframe. Using this function, we will obtain a list of every column name that is present in the Dataframe.

from pyspark.sql import SparkSession

# Create a SparkSession object
spark = SparkSession.builder.appName("Get Column Names").getOrCreate()

# Create a sample dataframe
data = [("Alice", 25), ("Bob", 30), ("Charlie", 35)]
df = spark.createDataFrame(data, ["Name", "Age"])

# Get the column names
column_names = df.columns

# Print the column names
print(column_names)

Output

['Name', 'Age']

In this example, we first establish a sample dataframe called df with two columns: "Name" and "Age". The list of column names is then obtained using the columns attribute, and it is saved in the column_names variable. Finally, we use the print function to output the column names.

Approach 2

The column names in this example are obtained using the select() function from the dataframe object. We iterate through the columns of the dataframe using a list comprehension and call the col() method on each column name. The actual column name is subsequently obtained using the name property, which we then send as an argument to the select() function. Only the specified columns are present in the resulting dataframe, which we can obtain using the columns attribute. Finally, we use the print function to output the column names.

Example

from pyspark.sql.functions import col
from pyspark.sql import SparkSession

# Create a SparkSession object
spark = SparkSession.builder.appName("Get Column Names").getOrCreate()

# Create a sample dataframe
data = [("Alice", 25), ("Bob", 30), ("Charlie", 35)]
df = spark.createDataFrame(data, ["Name", "Age"])

# Use the select() function to get column names
column_names = df.select([col(c).name for c in df.columns]).columns

# Print the column names
Print(column_names)

Output

['Name', 'Age']

Conclusion

The columns attribute can be used to obtain the name of a PySpark DataFrame column. The column names in the DataFrame are represented by a list of strings that this attribute delivers.

Use the createDataFrame() method of PySpark to make a DataFrame, then supply the data and the column names as arguments to that DataFrame to use this attribute. The columns attribute can then be used to obtain the DataFrame's column names. The output will be a set of strings that correspond to the DataFrame's column names.

Updated on: 24-Jul-2023

1K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements