How To Convert Sklearn Dataset To Pandas Dataframe in Python?

Scikit-learn (sklearn) is one of the most popular machine learning libraries for Python. It provides a range of efficient tools for machine learning and statistical modeling, including a variety of datasets. These datasets are provided in the form of numpy arrays, which can be difficult to work with for certain tasks, such as exploratory data analysis.

Pandas is a popular data manipulation library that provides powerful tools for data analysis and manipulation. It provides data structures for efficiently storing and manipulating large datasets, and provides a wide range of tools for data cleaning, transformation, and analysis.

Below are the two main approaches to convert a sklearn dataset to pandas dataframe ?

  • Using pd.DataFrame() directly ? Convert the sklearn Bunch object directly to a pandas dataframe using the pd.DataFrame() method.

  • Using pd.DataFrame.from_records() ? Load the dataset and convert data into records before creating a pandas dataframe.

Method 1: Using pd.DataFrame() Directly

This is the most straightforward approach to convert sklearn datasets to pandas DataFrames ?

Example

from sklearn.datasets import load_iris
import pandas as pd

# Load the iris dataset from sklearn
iris = load_iris()

# Convert the iris dataset to a pandas dataframe
df = pd.DataFrame(iris.data, columns=iris.feature_names)

# Add the target variable to the dataframe
df['target'] = iris.target

# Print the first 5 rows of the dataframe
print(df.head())
   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  target
0                5.1               3.5                1.4               0.2       0
1                4.9               3.0                1.4               0.2       0
2                4.7               3.2                1.3               0.2       0
3                4.6               3.1                1.5               0.2       0
4                5.0               3.6                1.4               0.2       0

How It Works

  • Load the iris dataset using load_iris() function

  • Create a pandas dataframe using iris.data as the data and iris.feature_names as column names

  • Add the target variable using iris.target as a new column

  • Display the first 5 rows using the head() method

Method 2: Using pd.DataFrame.from_records()

This method converts data to records first, then creates the DataFrame ?

Example

from sklearn.datasets import load_iris
import pandas as pd

# Load the iris dataset from sklearn
iris = load_iris()

# Convert the dataset to a list of tuples
data = [tuple(row) for row in iris.data]

# Create a Pandas DataFrame from the list of tuples
df = pd.DataFrame.from_records(data, columns=iris.feature_names)

# Add the target variable to the DataFrame
df['target'] = iris.target

# Show the first five rows of the DataFrame
print(df.head())
   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  target
0                5.1               3.5                1.4               0.2       0
1                4.9               3.0                1.4               0.2       0
2                4.7               3.2                1.3               0.2       0
3                4.6               3.1                1.5               0.2       0
4                5.0               3.6                1.4               0.2       0

Using Different Sklearn Datasets

You can apply the same approach to other sklearn datasets ?

from sklearn.datasets import load_wine, load_breast_cancer
import pandas as pd

# Load wine dataset
wine = load_wine()
wine_df = pd.DataFrame(wine.data, columns=wine.feature_names)
wine_df['target'] = wine.target

print("Wine dataset shape:", wine_df.shape)
print("Wine columns:", wine_df.columns.tolist()[:5])  # Show first 5 columns
Wine dataset shape: (178, 14)
Wine columns: ['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium']

Comparison

Method Performance Code Simplicity Best For
pd.DataFrame() Faster Simple Most use cases
pd.DataFrame.from_records() Slower More complex When data needs preprocessing

Conclusion

Converting sklearn datasets to pandas DataFrames is straightforward using pd.DataFrame() directly. This approach preserves feature names and allows easy addition of target variables. Use pd.DataFrame.from_records() only when you need to preprocess the data first.

Updated on: 2026-03-27T11:03:36+05:30

7K+ Views

Advertisements