How To Convert Sklearn Dataset To Pandas Dataframe in Python?


Scikit−learn (sklearn) is one of the most popular machine learning libraries for Python. It provides a range of efficient tools for machine learning and statistical modelling, including a variety of datasets. These datasets are provided in the form of numpy arrays, which can be difficult to work with for certain tasks, such as exploratory data analysis.

Pandas is a popular data manipulation library that provides powerful tools for data analysis and manipulation. It provides data structures for efficiently storing and manipulating large datasets, and provides a wide range of tools for data cleaning, transformation, and analysis.

Below are the two approaches with which we can convert a sklearn dataset to pandas dataframe.

  • Converting sklearn Bunch object to pandas DataFrame: In this approach, we will convert the sklearn Bunch object directly to a pandas dataframe using the pd.DataFrame() method.

  • Using load_iris() method to load iris dataset into pandas DataFrame:In this approach, we will load the iris dataset using the load_iris() method provided by sklearn and then convert the data into a pandas dataframe.

Now that we are aware of both the approaches let's make use of them with the help of examples.

Using sklearn Bunch object

Consider the code shown below.

Example

from sklearn.datasets import load_iris
import pandas as pd

# Load the iris dataset from sklearn
iris = load_iris()

# Convert the iris dataset to a pandas dataframe
df = pd.DataFrame(iris.data, columns=iris.feature_names)

# Add the target variable to the dataframe
df['target'] = iris.target

# Print the first 5 rows of the dataframe
print(df.head())

Explanation

  • First, we import the load_iris function from the sklearn.datasets module and the pandas library.

  • Then, we load the iris dataset into the iris variable using the load_iris() function.

  • We create a pandas dataframe df using the iris data and feature names. Here, we pass iris.data as the data and iris.feature_names as the columns parameter in the pd.DataFrame() method.

  • Next, we add the target variable to the pandas dataframe using iris.target and assign it to a new column target in the dataframe df.

  • Finally, we print the first 5 rows of the pandas dataframe df using the head() method.

Output

    sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  target
0                5.1               3.5                              1.4               0.2                      0
1                4.9               3.0                              1.4               0.2                      0
2                4.7               3.2                               1.3              0.2                      0
3                4.6               3.1                               1.5              0.2                      0
4                5.0               3.6                               1.4              0.2                      0

Using load_iris() method

Consider the code shown below.

Example

from sklearn.datasets import load_iris
import pandas as pd

# Load the iris dataset from sklearn
iris = load_iris()

# Convert the dataset to a list of tuples
data = [tuple(row) for row in iris.data]

# Create a Pandas DataFrame from the list of tuples
df = pd.DataFrame.from_records(data, columns=iris.feature_names)

# Add the target variable to the DataFrame
df['target'] = iris.target

# Show the first five rows of the DataFrame
print(df.head())

Explanation

  • Import the required libraries: We first import the necessary libraries for this approach, including pandas, numpy, and sklearn.

  • Load the dataset using the load_boston function from the sklearn.datasets module: We use the load_boston function to load the Boston Housing dataset into a variable called boston.

  • Convert the data into a Pandas dataframe: We convert the data into a Pandas dataframe using the pd.DataFrame() function.

  • Add the feature names as column names: We set the column names of the dataframe using the feature_names attribute of the boston dataset.

  • Add the target variable to the dataframe: We add the target variable to the dataframe by creating a new column called "PRICE" and setting its values to the target variable in the boston dataset.

  • Display the first few rows of the dataframe: We use the head() function to display the first few rows of the newly created Pandas dataframe.

Output

    sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  target
0                5.1               3.5                              1.4               0.2                      0
1                4.9               3.0                              1.4               0.2                      0
2                4.7               3.2                               1.3              0.2                      0
3                4.6               3.1                               1.5              0.2                      0
4                5.0               3.6                               1.4              0.2                      0

Conclusion

In conclusion, converting a Sklearn dataset to a Pandas dataframe is a simple process that can be done in multiple ways. Whether you choose to use the Sklearn built−in method or the Pandas method, the resulting Pandas dataframe can be easily manipulated and analysed using various data science libraries in Python.

Updated on: 03-Aug-2023

3K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements