How to append new rows to DataFrame using a Template In Python Pandas

PythonServer Side ProgrammingProgramming

How to append new rows to DataFrame using a Template In Python Pandas.

Introduction

Being a data engineering specialist, i often end up creating more derived columns than rows as the role of creating and sending the data to me for analysis should be taken care of other database specialists. However,it is not true during all time.

We have to create sample rows rather than waiting for data specialists team to send us the data. In this topic i will be showing the neat tricks for creating rows.

How to do it..

In this recipe, we will begin by appending rows to a small dataset with the .loc attribute and then will be using the .append method.

1.Let us begin by creating a dataframe to add rows later on.

Example

import pandas as pd
import numpy as np

players_info = pd.DataFrame(data=[
{"players": "Roger Federer", "titles": 20},
{"players": "Rafael Nadal", "titles": 20},
{"players": "Novak Djokovic", "titles": 17},
{"players": "Andy Murray", "titles": 3}], columns=["players", "titles"])

output

print(players_info.info())

Example

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 players 4 non-null object
1 titles 4 non-null int64
dtypes: int64(1), object(1)
memory usage: 192.0+ bytes
None

1. Now let us add new player "Dominic Theim" to the dataframe using .loc attribute.

new_Player = ['Dominic Theim', 1]
players_info.loc[4] = new_Player

output

print(players_info)


players titles
0 Roger Federer 20
1 Rafael Nadal 20
2 Novak Djokovic 17
3 Andy Murray 3
4 Dominic Theim 1

1. Using the same .loc attribute, let us add the new row to the end of dataframe. Here i will show how to add a dictionary to the dataframe.

new_player = {'players': 'Daniel Medvedev', 'titles': 0}
players_info.loc[len(players_info)] = new_player

Output

print(players_info)


players titles
0 Roger Federer 20
1 Rafael Nadal 20
2 Novak Djokovic 17
3 Andy Murray 3
4 Dominic Theim 1
5 Daniel Medvedev 0

1. We can also add pandas series holding data to the dataframe.

players_info.loc[len(players_info)] = pd.Series({'players': 'Andy Zverev', 'titles': 0})

Output

print(players_info)
players titles
0 Roger Federer 20
1 Rafael Nadal 20
2 Novak Djokovic 17
3 Andy Murray 3
4 Dominic Theim 1
5 Daniel Medvedev 0
6 Andy Zverev 0

Conclusion 

We have added data in above 4 steps using .loc method. The .loc attribute makes changes to the DataFrame in place.

In the next few steps, we will look at the .append method, which does not modify the calling DataFrame, rather it returns a new copy of the DataFrame with the appended row/s.

The first argument to .append must be either another DataFrame, Series, dictionary, or a list.

Example

# Create a DataFrame with index
players_info = pd.DataFrame(data=[
{"players": "Roger Federer", "titles": 20},
{"players": "Rafael Nadal", "titles": 20},
{"players": "Novak Djokovic", "titles": 17},
{"players": "Andy Murray", "titles": 3}], columns=["players", "titles"],
index=["roger", "nadal", "djokovic", "murray"])


# Add a new row(dictionary) to DataFrame using .append method.
players_info.append({'players': 'Daniel Medvedev', 'titles': 0})


---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
in
1 # Add a new row(dictionary) to DataFrame using .append method.
----> 2 players_info.append({'players': 'Daniel Medvedev', 'titles': 0})

~\anaconda3\lib\site-packages\pandas\core\frame.py in append(self, other, ignore_index, verify_integrity, sort)
7046 other = Series(other)
7047 if other.name is None and not ignore_index:
-> 7048 raise TypeError(
7049 "Can only append a Series if ignore_index=True "
7050 "or if the Series has a name"

TypeError Can only append a Series if ignore_index=True or if the Series has a name

When i tried to append a dictionary, it raised an exception asking us to use the parameter ignore_index=True. So let me add this suggested parameter and see what it does.

new_df = players_info.append({'players': 'Daniel Medvedev', 'titles': 0}, ignore_index=True)

Output

print(f" *** Original with index \n {players_info} \n\n\n *** Modified index \n {new_df}")


*** Original with index
players titles
roger Roger Federer 20
nadal Rafael Nadal 20
djokovic Novak Djokovic 17
murray Andy Murray 3

*** Modified index
players titles
0 Roger Federer 20
1 Rafael Nadal 20
2 Novak Djokovic 17
3 Andy Murray 3
4 Daniel Medvedev 0

What did you observed from the output after i used ignore_index=True parameter? Yes, When ignore_index set to True, the old index will be removed completely and replaced with a RangeIndex from 0 to n-1.

The .append method is quite useful when you want to add many rows to the DataFrame in a single go.

player1 = pd.Series({'players': 'Andy Zverev', 'titles': 0}, name='zverev')
player2 = pd.Series({'players': 'Dominic Theim', 'titles': 1}, name='theim')
new_df_1 = players_info.append([player1, player2])

Output

print(new_df_1)
players titles
roger Roger Federer 20
nadal Rafael Nadal 20
djokovic Novak Djokovic 17
murray Andy Murray 3
zverev Andy Zverev 0
theim Dominic Theim 1

Ok, now that you have seen basics on how to add rows, we will dive into how to add rows on a dataframe with many columns.

df = pd.read_csv("https://raw.githubusercontent.com/sasankac/TestDataSet/master/movies_data.csv")

Output

print(df.info())

Example

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 budget 4803 non-null int64
1 id 4803 non-null int64
2 original_language 4803 non-null object
3 original_title 4803 non-null object
4 popularity 4803 non-null float64
5 release_date 4802 non-null object
6 revenue 4803 non-null int64
7 runtime 4801 non-null float64
8 status 4803 non-null object
9 title 4803 non-null object
10 vote_average 4803 non-null float64
11 vote_count 4803 non-null int64
dtypes: float64(3), int64(4), object(5)
memory usage: 450.4+ KB
None

This dataset from Google have 12 columns and it is very easy to mistype a column name or forget one altogether if you were manually entering new rows of data. So how can we avoid this problem? Well there is a way, just create a template of columns names.

columns_dictionary = df.iloc[0].to_dict()
##### Output:


print(columns_dictionary)


{'budget': 237000000, 'id': 19995, 'original_language': 'en', 'original_title': 'Avatar', 'popularity': 150.437577, 'release_date': '10/12/2009', 'revenue': 2787965087, 'runtime': 162.0, 'status': 'Released', 'title': 'Avatar', 'vote_average': 7.2, 'vote_count': 11800}

So, you might understand now that we have taken the first row and converted it into a dictionary. Ok we got the columns and values as well, let us now Clear the old values with a dictionary comprehension assigning any previous string value as an empty string and all others as missing values.

This dictionary can now serve as a template for any new data you would like to enter.

Example

import datetime

new_data_dict = {}
for a, b in columns_dictionary.items():
if isinstance(b, str):
new_data_dict[a] = np.random.choice(list('abcde'))
elif isinstance(b, datetime.date):
new_data_dict[a] = np.nan
else:
new_data_dict[a] = np.nan

Output

print(new_data_dict)


{'budget': nan, 'id': nan, 'original_language': 'e', 'original_title': 'a', 'popularity': nan, 'release_date': 'b', 'revenue': nan, 'runtime': nan, 'status': 'e', 'title': 'c', 'vote_average': nan, 'vote_count': nan}
raja
Published on 09-Nov-2020 10:06:16
Advertisements