- Trending Categories
- Data Structure
- Operating System
- MS Excel
- C Programming
- Social Studies
- Fashion Studies
- Legal Studies
- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who
Creating a Dataframe using CSV files
In this technical document, we will explore the process of creating a dataframe using CSV files in Python. Specifically, we will cover the following subsections −
Introduction to dataframes and CSV files
Reading CSV files into dataframes
Writing dataframes to CSV files
Throughout this document, we will use real world examples and provide code snippets to illustrate each subsection.
What are dataframes and CSV files?
Before diving into the details of creating a dataframe from a CSV file, let's first define what a dataframe is and what a CSV file is.
A dataframe is a two-dimensional, size-mutable, tabular data structure with columns of potentially different types. It is similar to a spreadsheet or SQL table, and is commonly used to store and manipulate data in Python.
A CSV (comma-separated values) file, on the other hand, is a plain text file that stores data in a tabular format, with each row representing a record and each column representing a field. CSV files are a common way to store data because they are easy to read and write, and can be opened in many different applications, including Excel and Python.
Reading CSV files into dataframes
The first step in creating a dataframe from a CSV file is to read the file into Python. This can be done using the `pandas` library, which provides a simple way to read in CSV files as dataframes.
import pandas as pd df = pd.read_csv('filename.csv')
In this example, we first import the `pandas` library and then read in a CSV file named `filename.csv` using the `pd.read_csv` function. The resulting object, `df`, is a dataframe that contains the data from the CSV file.
It's worth noting that the `read_csv` function has many optional parameters that can be used to customize how the CSV file is read. For example, you can specify the delimiter used in the file (in case it's not a comma), the encoding, and whether or not the file contains a header row.
Once we have read in a CSV file as a dataframe, we can begin to explore and analyze the data. Some common operations include −
Viewing the first few rows of the dataframe using the `head` function
Checking the shape of the dataframe (number of rows and columns) using the `shape` attribute
Viewing summary statistics of the dataframe using the `describe` function
Selecting a subset of columns or rows using indexing and slicing
Let's take a look at an example. Suppose we have a CSV file that contains information about movies, including the title, year, genre, and runtime. We can read in the file as a dataframe and then view the first few rows using the `head` function
df = pd.read_csv('movies.csv') print(df.head())
This will output the first 5 rows of the dataframe
Title Year Genre Runtime 0 The Shawshank Redemption 1994 Drama 142 1 The Godfather 1972 Crime 175 2 The Godfather: Part II 1974 Crime 202 3 The Dark Knight 2008 Action 152 4 12 Angry Men 1957 Drama 96
We can also check the shape of the dataframe
To view summary statistics of the dataframe, we can use the `describe` function −
This will output the following −
Year Runtime count 250.000000 250.000000 mean 1984.356000 118.840000 std 24.012321 23.118059 min 1921.000000 69.000000 25% 1964.000000 100.000000 50% 1995.000000 116.000000 75% 2003.000000 131.000000 max 2016.000000 229.000000
Finally, we can select a subset of columns or rows using indexing and slicing. For example, to select only the title and genre columns −
subset = df[['Title', 'Genre']] print(subset.head())
Title Genre 0 The Shawshank Redemption Drama 1 The Godfather Crime 2 The Godfather: Part II Crime 3 The Dark Knight Action 4 12 Angry Men Drama
Beyond simply exploring the data, we may want to manipulate it in various ways, such as sorting, filtering, merging, and pivoting. In this subsection, we will cover a few common dataframe manipulation operations using real world examples.
To sort a dataframe by one or more columns, we can use the `sort_values` function. For example, to sort our movie dataframe by year in descending order −
sorted_df = df.sort_values('Year', ascending=False) print(sorted_df.head())
This will output the first 5 rows of the dataframe, sorted by year in descending order −
Title Year Genre Runtime 15 Logan 2017 Action 137 127 The Revenant 2015 Adventure 156 117 Whiplash 2014 Drama 107 111 X-Men: Days of Future Past 2014 Action 132 95 The Lego Movie 2014 Animation 100
To filter a dataframe based on one or more conditions, we can use boolean indexing. For example, to select only the action movies in our movie dataframe −
subset = df[df['Genre'] == 'Action'] print(subset.head())
This will output the first 5 action movies in the dataframe
Title Year Genre Runtime 3 The Dark Knight 2008 Action 152 6 The Silence of the Lambs 1991 Action 118 7 Inception 2010 Action 148 16 Terminator 2: Judgment Day 1991 Action 137 20 Forrest Gump 1994 Action 142
To combine two or more dataframes into a single dataframe, we can use the `merge` function. For example, suppose we have a second CSV file that contains the ratings for each movie in our original dataframe. We can read in this file as a separate dataframe and then merge it with our original dataframe based on a common column (in this case, the title of the movie) −
ratings_df = pd.read_csv('ratings.csv') merged_df = pd.merge(df, ratings_df, on='Title') print(merged_df.head())
This will output the merged dataframe, which contains both the movie information and the ratings information
Title Year Genre Runtime Rating 0 The Shawshank Redemption 1994 Drama 142 9.3 1 The Godfather 1972 Crime 175 9.2 2 The Godfather: Part II 1974 Crime 202 9.0 3 The Dark Knight 2008 Action 152 9.0 4 12 Angry Men 1957 Drama 96 8.9
To pivot a dataframe, we can use the `pivot_table` function. For example, suppose we want to see the average runtime of movies by genre. We can pivot our original movie dataframe −
pivot_df = pd.pivot_table(df, values='Runtime', columns='Genre', aggfunc='mean') print(pivot_df)
This will output a table that shows the average runtime of movies by genre
Genre Action Adventure Animation Comedy Crime \ Runtime 126.304348 118.054054 98.250000 107.111111 128.666667 Genre Documentary Drama Family Fantasy Film-Noir \ Runtime 85.333333 126.539326 111.666667 126.300000 105.000000 Genre History Horror Music Musical Mystery \ Runtime 123.375 108.204545 131.133333 121.714286 114.200000 Genre Romance Sci-Fi Sport Thriller War Western Runtime 116.6 121.266667 129.428571 120.046875 134.125 117.833333
Writing dataframes to CSV files
Finally, after we have manipulated and analyzed our dataframe, we may want to write it back to a CSV file for future use. This can be done using the `to_csv` function.
In this example, we write our dataframe to a new CSV file named `new_file.csv`, with `index=False` to exclude the index column from the file.
In this technical document, we covered the process of creating a dataframe using CSV files in Python. We began by defining what dataframes and CSV files are, and then explored how to read CSV files into dataframes, how to explore and manipulate dataframes, and how to write dataframes back to CSV files. Using real world examples and code snippets, we illustrated each subsection in detail. By mastering the techniques covered in this document, you'll be well on your way to becoming proficient in working with dataframes and CSV files in Python.
Kickstart Your Career
Get certified by completing the courseGet Started