Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
How To Convert Sklearn Dataset To Pandas Dataframe in Python?
Scikit-learn (sklearn) is one of the most popular machine learning libraries for Python. It provides a range of efficient tools for machine learning and statistical modeling, including a variety of datasets. These datasets are provided in the form of numpy arrays, which can be difficult to work with for certain tasks, such as exploratory data analysis.
Pandas is a popular data manipulation library that provides powerful tools for data analysis and manipulation. It provides data structures for efficiently storing and manipulating large datasets, and provides a wide range of tools for data cleaning, transformation, and analysis.
Below are the two main approaches to convert a sklearn dataset to pandas dataframe ?
Using pd.DataFrame() directly ? Convert the sklearn Bunch object directly to a pandas dataframe using the
pd.DataFrame()method.Using pd.DataFrame.from_records() ? Load the dataset and convert data into records before creating a pandas dataframe.
Method 1: Using pd.DataFrame() Directly
This is the most straightforward approach to convert sklearn datasets to pandas DataFrames ?
Example
from sklearn.datasets import load_iris import pandas as pd # Load the iris dataset from sklearn iris = load_iris() # Convert the iris dataset to a pandas dataframe df = pd.DataFrame(iris.data, columns=iris.feature_names) # Add the target variable to the dataframe df['target'] = iris.target # Print the first 5 rows of the dataframe print(df.head())
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) target 0 5.1 3.5 1.4 0.2 0 1 4.9 3.0 1.4 0.2 0 2 4.7 3.2 1.3 0.2 0 3 4.6 3.1 1.5 0.2 0 4 5.0 3.6 1.4 0.2 0
How It Works
Load the iris dataset using
load_iris()functionCreate a pandas dataframe using
iris.dataas the data andiris.feature_namesas column namesAdd the target variable using
iris.targetas a new columnDisplay the first 5 rows using the
head()method
Method 2: Using pd.DataFrame.from_records()
This method converts data to records first, then creates the DataFrame ?
Example
from sklearn.datasets import load_iris import pandas as pd # Load the iris dataset from sklearn iris = load_iris() # Convert the dataset to a list of tuples data = [tuple(row) for row in iris.data] # Create a Pandas DataFrame from the list of tuples df = pd.DataFrame.from_records(data, columns=iris.feature_names) # Add the target variable to the DataFrame df['target'] = iris.target # Show the first five rows of the DataFrame print(df.head())
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) target 0 5.1 3.5 1.4 0.2 0 1 4.9 3.0 1.4 0.2 0 2 4.7 3.2 1.3 0.2 0 3 4.6 3.1 1.5 0.2 0 4 5.0 3.6 1.4 0.2 0
Using Different Sklearn Datasets
You can apply the same approach to other sklearn datasets ?
from sklearn.datasets import load_wine, load_breast_cancer
import pandas as pd
# Load wine dataset
wine = load_wine()
wine_df = pd.DataFrame(wine.data, columns=wine.feature_names)
wine_df['target'] = wine.target
print("Wine dataset shape:", wine_df.shape)
print("Wine columns:", wine_df.columns.tolist()[:5]) # Show first 5 columns
Wine dataset shape: (178, 14) Wine columns: ['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium']
Comparison
| Method | Performance | Code Simplicity | Best For |
|---|---|---|---|
pd.DataFrame() |
Faster | Simple | Most use cases |
pd.DataFrame.from_records() |
Slower | More complex | When data needs preprocessing |
Conclusion
Converting sklearn datasets to pandas DataFrames is straightforward using pd.DataFrame() directly. This approach preserves feature names and allows easy addition of target variables. Use pd.DataFrame.from_records() only when you need to preprocess the data first.
