Creating a Dataframe using CSV files


In this technical document, we will explore the process of creating a dataframe using CSV files in Python. Specifically, we will cover the following subsections −

  • Introduction to dataframes and CSV files

  • Reading CSV files into dataframes

  • Exploring dataframes

  • Manipulating dataframes

  • Writing dataframes to CSV files

Throughout this document, we will use real world examples and provide code snippets to illustrate each subsection.

What are dataframes and CSV files?

Before diving into the details of creating a dataframe from a CSV file, let's first define what a dataframe is and what a CSV file is.

A dataframe is a two-dimensional, size-mutable, tabular data structure with columns of potentially different types. It is similar to a spreadsheet or SQL table, and is commonly used to store and manipulate data in Python.

A CSV (comma-separated values) file, on the other hand, is a plain text file that stores data in a tabular format, with each row representing a record and each column representing a field. CSV files are a common way to store data because they are easy to read and write, and can be opened in many different applications, including Excel and Python.

Reading CSV files into dataframes

The first step in creating a dataframe from a CSV file is to read the file into Python. This can be done using the `pandas` library, which provides a simple way to read in CSV files as dataframes.

Syntax

import pandas as pd
df = pd.read_csv('filename.csv')

In this example, we first import the `pandas` library and then read in a CSV file named `filename.csv` using the `pd.read_csv` function. The resulting object, `df`, is a dataframe that contains the data from the CSV file.

It's worth noting that the `read_csv` function has many optional parameters that can be used to customize how the CSV file is read. For example, you can specify the delimiter used in the file (in case it's not a comma), the encoding, and whether or not the file contains a header row.

Exploring dataframes

Once we have read in a CSV file as a dataframe, we can begin to explore and analyze the data. Some common operations include −

  • Viewing the first few rows of the dataframe using the `head` function

  • Checking the shape of the dataframe (number of rows and columns) using the `shape` attribute

  • Viewing summary statistics of the dataframe using the `describe` function

  • Selecting a subset of columns or rows using indexing and slicing

Let's take a look at an example. Suppose we have a CSV file that contains information about movies, including the title, year, genre, and runtime. We can read in the file as a dataframe and then view the first few rows using the `head` function

Syntax

df = pd.read_csv('movies.csv')
print(df.head())

This will output the first 5 rows of the dataframe

Output

                      Title  Year      Genre   Runtime
0  The Shawshank Redemption  1994      Drama      142
1             The Godfather  1972      Crime      175
2    The Godfather: Part II  1974      Crime      202
3           The Dark Knight  2008     Action      152
4              12 Angry Men  1957      Drama       96

We can also check the shape of the dataframe

print(df.shape)

To view summary statistics of the dataframe, we can use the `describe` function −

 print(df.describe())

This will output the following −

Output

              Year     Runtime
count   250.000000  250.000000
mean   1984.356000  118.840000
std      24.012321   23.118059
min    1921.000000   69.000000
25%    1964.000000  100.000000
50%    1995.000000  116.000000
75%    2003.000000  131.000000
max    2016.000000  229.000000

Finally, we can select a subset of columns or rows using indexing and slicing. For example, to select only the title and genre columns −

Example

subset = df[['Title', 'Genre']]
print(subset.head())

Output

                     Title       Genre
0  The Shawshank Redemption      Drama
1             The Godfather      Crime
2    The Godfather: Part II      Crime
3           The Dark Knight     Action
4              12 Angry Men      Drama

Manipulating dataframes

Beyond simply exploring the data, we may want to manipulate it in various ways, such as sorting, filtering, merging, and pivoting. In this subsection, we will cover a few common dataframe manipulation operations using real world examples.

Sorting

To sort a dataframe by one or more columns, we can use the `sort_values` function. For example, to sort our movie dataframe by year in descending order −

Example

sorted_df = df.sort_values('Year', ascending=False)
print(sorted_df.head())

This will output the first 5 rows of the dataframe, sorted by year in descending order −

Output

                           Title  Year      Genre  Runtime
15                         Logan  2017     Action      137
127                 The Revenant  2015  Adventure      156
117                     Whiplash  2014      Drama      107
111  X-Men: Days of Future Past  2014     Action      132
95               The Lego Movie  2014  Animation      100

Filtering

To filter a dataframe based on one or more conditions, we can use boolean indexing. For example, to select only the action movies in our movie dataframe −

Example

subset = df[df['Genre'] == 'Action']
print(subset.head())

This will output the first 5 action movies in the dataframe

Output

                         Title  Year   Genre  Runtime
3               The Dark Knight  2008  Action      152
6     The Silence of the Lambs  1991  Action      118
7                    Inception  2010  Action      148
16  Terminator 2: Judgment Day  1991  Action      137
20                Forrest Gump  1994  Action      142

Merging

To combine two or more dataframes into a single dataframe, we can use the `merge` function. For example, suppose we have a second CSV file that contains the ratings for each movie in our original dataframe. We can read in this file as a separate dataframe and then merge it with our original dataframe based on a common column (in this case, the title of the movie) −

Example

ratings_df = pd.read_csv('ratings.csv')
merged_df = pd.merge(df, ratings_df, on='Title')
print(merged_df.head())

This will output the merged dataframe, which contains both the movie information and the ratings information

Output

                  Title    Year   Genre  Runtime  Rating
0    The Shawshank Redemption 1994   Drama   142     9.3
1      The Godfather  1972           Crime   175     9.2
2    The Godfather: Part II  1974    Crime   202     9.0
3           The Dark Knight  2008    Action  152     9.0
4              12 Angry Men  1957    Drama    96     8.9

Pivoting

To pivot a dataframe, we can use the `pivot_table` function. For example, suppose we want to see the average runtime of movies by genre. We can pivot our original movie dataframe −

Example

pivot_df = pd.pivot_table(df, values='Runtime', columns='Genre', aggfunc='mean')
print(pivot_df)

Output

This will output a table that shows the average runtime of movies by genre

Genre           Action   Adventure   Animation       Comedy        Crime  \
Runtime     126.304348  118.054054   98.250000  107.111111  128.666667

Genre      Documentary       Drama      Family     Fantasy    Film-Noir  \
Runtime    85.333333  126.539326  111.666667  126.300000  105.000000

Genre      History      Horror       Music     Musical     Mystery  \
Runtime    123.375  108.204545  131.133333  121.714286  114.200000

Genre      Romance      Sci-Fi     Sport   Thriller        War     Western  
Runtime    116.6  121.266667  129.428571  120.046875  134.125  117.833333  

Writing dataframes to CSV files

Finally, after we have manipulated and analyzed our dataframe, we may want to write it back to a CSV file for future use. This can be done using the `to_csv` function.

df.to_csv('new_file.csv', index=False)

In this example, we write our dataframe to a new CSV file named `new_file.csv`, with `index=False` to exclude the index column from the file.

Conclusion

In this technical document, we covered the process of creating a dataframe using CSV files in Python. We began by defining what dataframes and CSV files are, and then explored how to read CSV files into dataframes, how to explore and manipulate dataframes, and how to write dataframes back to CSV files. Using real world examples and code snippets, we illustrated each subsection in detail. By mastering the techniques covered in this document, you'll be well on your way to becoming proficient in working with dataframes and CSV files in Python.

Updated on: 25-Apr-2023

3K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements