Python Pandas - IO Tools


Advertisements


The Pandas I/O API is a set of top level reader functions accessed like pd.read_csv() that generally return a Pandas object.

The two workhorse functions for reading text files (or the flat files) are read_csv() and read_table(). They both use the same parsing code to intelligently convert tabular data into a DataFrame object −

pandas.read_csv(filepath_or_buffer, sep=',', delimiter=None, header='infer',
names=None, index_col=None, usecols=None

pandas.read_csv(filepath_or_buffer, sep='\t', delimiter=None, header='infer',
names=None, index_col=None, usecols=None

Here is how the csv file data looks like −

S.No,Name,Age,City,Salary
1,Tom,28,Toronto,20000
2,Lee,32,HongKong,3000
3,Steven,43,Bay Area,8300
4,Ram,38,Hyderabad,3900

Save this data as temp.csv and conduct operations on it.

S.No,Name,Age,City,Salary
1,Tom,28,Toronto,20000
2,Lee,32,HongKong,3000
3,Steven,43,Bay Area,8300
4,Ram,38,Hyderabad,3900

Save this data as temp.csv and conduct operations on it.

read.csv

read.csv reads data from the csv files and creates a DataFrame object.

import pandas as pd
df=pd.read_csv("temp.csv")
print df

Its output is as follows −

   S.No     Name   Age       City   Salary
0     1      Tom    28    Toronto    20000
1     2      Lee    32   HongKong     3000
2     3   Steven    43   Bay Area     8300
3     4      Ram    38  Hyderabad     3900

custom index

This specifies a column in the csv file to customize the index using index_col.

import pandas as pd

df=pd.read_csv("temp.csv",index_col=['S.No'])
print df

Its output is as follows −

S.No   Name   Age       City   Salary
1       Tom    28    Toronto    20000
2       Lee    32   HongKong     3000
3    Steven    43   Bay Area     8300
4       Ram    38  Hyderabad     3900

Converters

dtype of the columns can be passed as a dict.

import pandas as pd

df = pd.read_csv("temp.csv", dtype={'Salary': np.float64})
print df.dtypes

Its output is as follows −

S.No       int64
Name      object
Age        int64
City      object
Salary   float64
dtype: object

By default, the dtype of the Salary column is int, but the result shows it as float because we have explicitly casted the type.

Thus, the data looks like float −

  S.No   Name   Age      City    Salary
0   1     Tom   28    Toronto   20000.0
1   2     Lee   32   HongKong    3000.0
2   3  Steven   43   Bay Area    8300.0
3   4     Ram   38  Hyderabad    3900.0

header_names

Specify the names of the header using the names argument.

import pandas as pd
 
df=pd.read_csv("temp.csv", names=['a', 'b', 'c','d','e'])
print df

Its output is as follows −

       a        b    c           d        e
0   S.No     Name   Age       City   Salary
1      1      Tom   28     Toronto    20000
2      2      Lee   32    HongKong     3000
3      3   Steven   43    Bay Area     8300
4      4      Ram   38   Hyderabad     3900

Observe, the header names are appended with the custom names, but the header in the file has not been eliminated. Now, we use the header argument to remove that.

If the header is in a row other than the first, pass the row number to header. This will skip the preceding rows.

import pandas as pd 

df=pd.read_csv("temp.csv",names=['a','b','c','d','e'],header=0)
print df

Its output is as follows −

      a        b    c           d        e
0  S.No     Name   Age       City   Salary
1     1      Tom   28     Toronto    20000
2     2      Lee   32    HongKong     3000
3     3   Steven   43    Bay Area     8300
4     4      Ram   38   Hyderabad     3900

skiprows

skiprows skips the number of rows specified.

import pandas as pd

df=pd.read_csv("temp.csv", skiprows=2)
print df

Its output is as follows −

    2      Lee   32    HongKong   3000
0   3   Steven   43    Bay Area   8300
1   4      Ram   38   Hyderabad   3900


Advertisements