Article Categories

Selected Reading

Creating a Dataframe using Excel files

Python Server Side Programming Programming

A DataFrame is a two-dimensional tabular data structure in Python used for data analysis. One of the most common ways to create a DataFrame is by importing data from Excel files using the pandas library.

Why DataFrames are Important for Data Analysis

DataFrames are essential for data analysis because they provide ?

Easy data manipulation Two-dimensional table-like structure for organizing and manipulating complex data with missing values or different data types.
Efficient processing Support vectorized operations that perform computations on entire arrays rather than iterating row by row.
Library integration Seamless integration with pandas, NumPy, Matplotlib, and Scikit-learn for data cleaning, visualization, and machine learning.
Data exploration Built-in tools for filtering, sorting, grouping, and creating various visualizations.

Prerequisites

Before starting, ensure you have ?

Python 3.x installed
Required libraries: pip install pandas openpyxl
Basic Python knowledge

Reading Excel Files with pandas

The read_excel() function reads data from both .xls and .xlsx formats ?

Basic Excel File Reading

First, let's create a sample Excel file and read it ?

import pandas as pd

# Create sample data
data = {'Name': ['John', 'Smith', 'Alex', 'James', 'Peter'],
        'Age': [25, 30, 27, 22, 32],
        'City': ['New York', 'London', 'Paris', 'Tokyo', 'Sydney']}

df = pd.DataFrame(data)

# Save to Excel file
df.to_excel('sample.xlsx', sheet_name='Sheet1', index=False)

# Read the Excel file
df_read = pd.read_excel('sample.xlsx', sheet_name='Sheet1')
print(df_read)

    Name  Age      City
0   John   25  New York
1  Smith   30    London
2   Alex   27     Paris
3  James   22     Tokyo
4  Peter   32    Sydney

Reading Specific Columns and Rows

You can specify which columns and rows to read using parameters ?

import pandas as pd

# Read only specific columns (A and B = Name and Age)
df_subset = pd.read_excel('sample.xlsx', sheet_name='Sheet1', usecols='A:B')
print("First two columns:")
print(df_subset)

print("\nSkipping first row as header:")
# Read with different header row
df_no_header = pd.read_excel('sample.xlsx', sheet_name='Sheet1', header=None)
print(df_no_header)

First two columns:
    Name  Age
0   John   25
1  Smith   30
2   Alex   27
3  James   22
4  Peter   32

Skipping first row as header:
       0    1         2
0   Name  Age      City
1   John   25  New York
2  Smith   30    London
3   Alex   27     Paris
4  James   22     Tokyo
5  Peter   32    Sydney

Reading Multiple Sheets

To read multiple sheets from an Excel file ?

import pandas as pd

# Create multi-sheet Excel file
df1 = pd.DataFrame({'Product': ['A', 'B', 'C'], 'Price': [100, 200, 300]})
df2 = pd.DataFrame({'Product': ['D', 'E', 'F'], 'Price': [400, 500, 600]})

with pd.ExcelWriter('multi_sheet.xlsx') as writer:
    df1.to_excel(writer, sheet_name='Quarter1', index=False)
    df2.to_excel(writer, sheet_name='Quarter2', index=False)

# Read multiple sheets
sheets_dict = pd.read_excel('multi_sheet.xlsx', sheet_name=['Quarter1', 'Quarter2'])

print("Quarter1 data:")
print(sheets_dict['Quarter1'])
print("\nQuarter2 data:")
print(sheets_dict['Quarter2'])

Quarter1 data:
  Product  Price
0       A    100
1       B    200
2       C    300

Quarter2 data:
  Product  Price
0       D    400
1       E    500
2       F    600

Data Cleaning Operations

Removing Duplicates

Remove duplicate rows using drop_duplicates() ?

import pandas as pd

# Create DataFrame with duplicates
data_with_dupes = {'Name': ['John', 'Smith', 'Alex', 'John', 'Smith'],
                   'Age': [25, 30, 27, 25, 30]}
df_dupes = pd.DataFrame(data_with_dupes)

print("Original data:")
print(df_dupes)

# Remove duplicates
df_clean = df_dupes.drop_duplicates()
print("\nAfter removing duplicates:")
print(df_clean)

Original data:
    Name  Age
0   John   25
1  Smith   30
2   Alex   27
3   John   25
4  Smith   30

After removing duplicates:
    Name  Age
0   John   25
1  Smith   30
2   Alex   27

Handling Missing Values

Detect and handle missing values using pandas functions ?

import pandas as pd
import numpy as np

# Create DataFrame with missing values
data_missing = {'Name': ['John', 'Smith', None, 'James'],
                'Age': [25, None, 27, 22],
                'Salary': [50000, 60000, 55000, None]}
df_missing = pd.DataFrame(data_missing)

print("Data with missing values:")
print(df_missing)
print("\nMissing values count:")
print(df_missing.isnull().sum())

# Fill missing values
df_filled = df_missing.fillna({'Name': 'Unknown', 'Age': df_missing['Age'].mean(), 'Salary': 0})
print("\nAfter filling missing values:")
print(df_filled)

Data with missing values:
    Name   Age   Salary
0   John  25.0  50000.0
1  Smith   NaN  60000.0
2   None  27.0  55000.0
3  James  22.0      NaN

Missing values count:
Name      1
Age       1
Salary    1
dtype: int64

After filling missing values:
      Name   Age   Salary
0     John  25.0  50000.0
1    Smith  24.7  60000.0
2  Unknown  27.0  55000.0
3    James  22.0      0.0

Common Parameters for read_excel()

Parameter	Description	Example
`sheet_name`	Specify sheet to read	`'Sheet1'` or `0`
`header`	Row to use as column names	`0` (first row)
`usecols`	Columns to read	`'A:C'` or `[0,1,2]`
`skiprows`	Rows to skip	`2` (skip first 2 rows)

Conclusion

Creating DataFrames from Excel files is straightforward using pandas' read_excel() function. This approach enables efficient data import, cleaning, and analysis for real-world data science projects.

Tamoghna Das

Updated on: 2026-03-27T05:58:24+05:30

5K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started

Previous Next