- Trending Categories
Data Structure
Networking
RDBMS
Operating System
Java
MS Excel
iOS
HTML
CSS
Android
Python
C Programming
C++
C#
MongoDB
MySQL
Javascript
PHP
Physics
Chemistry
Biology
Mathematics
English
Economics
Psychology
Social Studies
Fashion Studies
Legal Studies
- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who
How to append new rows to DataFrame using a Template In Python Pandas
How to append new rows to DataFrame using a Template In Python Pandas.
Introduction
Being a data engineering specialist, i often end up creating more derived columns than rows as the role of creating and sending the data to me for analysis should be taken care of other database specialists. However,it is not true during all time.
We have to create sample rows rather than waiting for data specialists team to send us the data. In this topic i will be showing the neat tricks for creating rows.
How to do it..
In this recipe, we will begin by appending rows to a small dataset with the .loc attribute and then will be using the .append method.
1.Let us begin by creating a dataframe to add rows later on.
Example
import pandas as pd import numpy as np players_info = pd.DataFrame(data=[ {"players": "Roger Federer", "titles": 20}, {"players": "Rafael Nadal", "titles": 20}, {"players": "Novak Djokovic", "titles": 17}, {"players": "Andy Murray", "titles": 3}], columns=["players", "titles"])
output
print(players_info.info())
Example
<class 'pandas.core.frame.DataFrame'> RangeIndex: 4 entries, 0 to 3 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 players 4 non-null object 1 titles 4 non-null int64 dtypes: int64(1), object(1) memory usage: 192.0+ bytes None
1. Now let us add new player "Dominic Theim" to the dataframe using .loc attribute.
new_Player = ['Dominic Theim', 1] players_info.loc[4] = new_Player
output
print(players_info)
players titles 0 Roger Federer 20 1 Rafael Nadal 20 2 Novak Djokovic 17 3 Andy Murray 3 4 Dominic Theim 1
1. Using the same .loc attribute, let us add the new row to the end of dataframe. Here i will show how to add a dictionary to the dataframe.
new_player = {'players': 'Daniel Medvedev', 'titles': 0} players_info.loc[len(players_info)] = new_player
Output
print(players_info)
players titles 0 Roger Federer 20 1 Rafael Nadal 20 2 Novak Djokovic 17 3 Andy Murray 3 4 Dominic Theim 1 5 Daniel Medvedev 0
1. We can also add pandas series holding data to the dataframe.
players_info.loc[len(players_info)] = pd.Series({'players': 'Andy Zverev', 'titles': 0})
Output
print(players_info)
players titles 0 Roger Federer 20 1 Rafael Nadal 20 2 Novak Djokovic 17 3 Andy Murray 3 4 Dominic Theim 1 5 Daniel Medvedev 0 6 Andy Zverev 0
Conclusion
We have added data in above 4 steps using .loc method. The .loc attribute makes changes to the DataFrame in place.
In the next few steps, we will look at the .append method, which does not modify the calling DataFrame, rather it returns a new copy of the DataFrame with the appended row/s.
The first argument to .append must be either another DataFrame, Series, dictionary, or a list.
Example
# Create a DataFrame with index players_info = pd.DataFrame(data=[ {"players": "Roger Federer", "titles": 20}, {"players": "Rafael Nadal", "titles": 20}, {"players": "Novak Djokovic", "titles": 17}, {"players": "Andy Murray", "titles": 3}], columns=["players", "titles"], index=["roger", "nadal", "djokovic", "murray"])
# Add a new row(dictionary) to DataFrame using .append method. players_info.append({'players': 'Daniel Medvedev', 'titles': 0})
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) --------------------------------------------------------------------------- TypeError Traceback (most recent call last) in 1 # Add a new row(dictionary) to DataFrame using .append method. ----> 2 players_info.append({'players': 'Daniel Medvedev', 'titles': 0}) ~\anaconda3\lib\site-packages\pandas\core\frame.py in append(self, other, ignore_index, verify_integrity, sort) 7046 other = Series(other) 7047 if other.name is None and not ignore_index: -> 7048 raise TypeError( 7049 "Can only append a Series if ignore_index=True " 7050 "or if the Series has a name"
TypeError Can only append a Series if ignore_index=True or if the Series has a name
When i tried to append a dictionary, it raised an exception asking us to use the parameter ignore_index=True. So let me add this suggested parameter and see what it does.
new_df = players_info.append({'players': 'Daniel Medvedev', 'titles': 0}, ignore_index=True)
Output
print(f" *** Original with index \n {players_info} \n\n\n *** Modified index \n {new_df}")
*** Original with index players titles roger Roger Federer 20 nadal Rafael Nadal 20 djokovic Novak Djokovic 17 murray Andy Murray 3 *** Modified index players titles 0 Roger Federer 20 1 Rafael Nadal 20 2 Novak Djokovic 17 3 Andy Murray 3 4 Daniel Medvedev 0
What did you observed from the output after i used ignore_index=True parameter? Yes, When ignore_index set to True, the old index will be removed completely and replaced with a RangeIndex from 0 to n-1.
The .append method is quite useful when you want to add many rows to the DataFrame in a single go.
player1 = pd.Series({'players': 'Andy Zverev', 'titles': 0}, name='zverev') player2 = pd.Series({'players': 'Dominic Theim', 'titles': 1}, name='theim') new_df_1 = players_info.append([player1, player2])
Output
print(new_df_1)
players titles roger Roger Federer 20 nadal Rafael Nadal 20 djokovic Novak Djokovic 17 murray Andy Murray 3 zverev Andy Zverev 0 theim Dominic Theim 1
Ok, now that you have seen basics on how to add rows, we will dive into how to add rows on a dataframe with many columns.
df = pd.read_csv("https://raw.githubusercontent.com/sasankac/TestDataSet/master/movies_data.csv")
Output
print(df.info())
Example
<class 'pandas.core.frame.DataFrame'> RangeIndex: 4803 entries, 0 to 4802 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 budget 4803 non-null int64 1 id 4803 non-null int64 2 original_language 4803 non-null object 3 original_title 4803 non-null object 4 popularity 4803 non-null float64 5 release_date 4802 non-null object 6 revenue 4803 non-null int64 7 runtime 4801 non-null float64 8 status 4803 non-null object 9 title 4803 non-null object 10 vote_average 4803 non-null float64 11 vote_count 4803 non-null int64 dtypes: float64(3), int64(4), object(5) memory usage: 450.4+ KB None
This dataset from Google have 12 columns and it is very easy to mistype a column name or forget one altogether if you were manually entering new rows of data. So how can we avoid this problem? Well there is a way, just create a template of columns names.
columns_dictionary = df.iloc[0].to_dict()
##### Output:
print(columns_dictionary)
{'budget': 237000000, 'id': 19995, 'original_language': 'en', 'original_title': 'Avatar', 'popularity': 150.437577, 'release_date': '10/12/2009', 'revenue': 2787965087, 'runtime': 162.0, 'status': 'Released', 'title': 'Avatar', 'vote_average': 7.2, 'vote_count': 11800}
So, you might understand now that we have taken the first row and converted it into a dictionary. Ok we got the columns and values as well, let us now Clear the old values with a dictionary comprehension assigning any previous string value as an empty string and all others as missing values.
This dictionary can now serve as a template for any new data you would like to enter.
Example
import datetime new_data_dict = {} for a, b in columns_dictionary.items(): if isinstance(b, str): new_data_dict[a] = np.random.choice(list('abcde')) elif isinstance(b, datetime.date): new_data_dict[a] = np.nan else: new_data_dict[a] = np.nan
Output
print(new_data_dict)
{'budget': nan, 'id': nan, 'original_language': 'e', 'original_title': 'a', 'popularity': nan, 'release_date': 'b', 'revenue': nan, 'runtime': nan, 'status': 'e', 'title': 'c', 'vote_average': nan, 'vote_count': nan}