Python Pandas - Quick Guide

Python Pandas - Introduction

Pandas is an open-source Python Library providing high-performance data manipulation and analysis tool using its powerful data structures. The name Pandas is derived from the word Panel Data – an Econometrics from Multidimensional data.

In 2008, developer Wes McKinney started developing pandas when in need of high performance, flexible tool for analysis of data.

Prior to Pandas, Python was majorly used for data munging and preparation. It had very little contribution towards data analysis. Pandas solved this problem. Using Pandas, we can accomplish five typical steps in the processing and analysis of data, regardless of the origin of data — load, prepare, manipulate, model, and analyze.

Python with Pandas is used in a wide range of fields including academic and commercial domains including finance, economics, Statistics, analytics, etc.

Key Features of Pandas

Fast and efficient DataFrame object with default and customized indexing.
Tools for loading data into in-memory data objects from different file formats.
Data alignment and integrated handling of missing data.
Reshaping and pivoting of date sets.
Label-based slicing, indexing and subsetting of large data sets.
Columns from a data structure can be deleted or inserted.
Group by data for aggregation and transformations.
High performance merging and joining of data.
Time Series functionality.

Python Pandas - Environment Setup

Standard Python distribution doesn't come bundled with Pandas module. A lightweight alternative is to install NumPy using popular Python package installer, pip.

pip install pandas

If you install Anaconda Python package, Pandas will be installed by default with the following −

Windows

Anaconda (from https://www.continuum.io) is a free Python distribution for SciPy stack. It is also available for Linux and Mac.
Canopy (https://www.enthought.com/products/canopy/) is available as free as well as commercial distribution with full SciPy stack for Windows, Linux and Mac.
Python (x,y) is a free Python distribution with SciPy stack and Spyder IDE for Windows OS. (Downloadable from http://python-xy.github.io/)

Linux

Package managers of respective Linux distributions are used to install one or more packages in SciPy stack.

For Ubuntu Users

sudo apt-get install python-numpy python-scipy python-matplotlibipythonipythonnotebook
python-pandas python-sympy python-nose

For Fedora Users

sudo yum install numpyscipy python-matplotlibipython python-pandas sympy
python-nose atlas-devel

Introduction to Data Structures

Pandas deals with the following three data structures −

Series
DataFrame
Panel

These data structures are built on top of Numpy array, which means they are fast.

Dimension & Description

The best way to think of these data structures is that the higher dimensional data structure is a container of its lower dimensional data structure. For example, DataFrame is a container of Series, Panel is a container of DataFrame.

Data Structure	Dimensions	Description
Series	1	1D labeled homogeneous array, sizeimmutable.
Data Frames	2	General 2D labeled, size-mutable tabular structure with potentially heterogeneously typed columns.
Panel	3	General 3D labeled, size-mutable array.

Building and handling two or more dimensional arrays is a tedious task, burden is placed on the user to consider the orientation of the data set when writing functions. But using Pandas data structures, the mental effort of the user is reduced.

For example, with tabular data (DataFrame) it is more semantically helpful to think of the index (the rows) and the columns rather than axis 0 and axis 1.

Mutability

All Pandas data structures are value mutable (can be changed) and except Series all are size mutable. Series is size immutable.

Note − DataFrame is widely used and one of the most important data structures. Panel is used much less.

Series

Series is a one-dimensional array like structure with homogeneous data. For example, the following series is a collection of integers 10, 23, 56, …

Key Points

Homogeneous data
Size Immutable
Values of Data Mutable

DataFrame

DataFrame is a two-dimensional array with heterogeneous data. For example,

Name	Age	Gender	Rating
Steve	32	Male	3.45
Lia	28	Female	4.6
Vin	45	Male	3.9
Katie	38	Female	2.78

The table represents the data of a sales team of an organization with their overall performance rating. The data is represented in rows and columns. Each column represents an attribute and each row represents a person.

Data Type of Columns

The data types of the four columns are as follows −

Column	Type
Name	String
Age	Integer
Gender	String
Rating	Float

Key Points

Heterogeneous data
Size Mutable
Data Mutable

Panel

Panel is a three-dimensional data structure with heterogeneous data. It is hard to represent the panel in graphical representation. But a panel can be illustrated as a container of DataFrame.

Key Points

Heterogeneous data
Size Mutable
Data Mutable

Python Pandas - Series

Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.). The axis labels are collectively called index.

pandas.Series

A pandas Series can be created using the following constructor −

pandas.Series( data, index, dtype, copy)

The parameters of the constructor are as follows −

Sr.No	Parameter & Description
1	data data takes various forms like ndarray, list, constants
2	index Index values must be unique and hashable, same length as data. Default np.arange(n) if no index is passed.
3	dtype dtype is for data type. If None, data type will be inferred
4	copy Copy data. Default False

A series can be created using various inputs like −

Array
Dict
Scalar value or constant

Create an Empty Series

A basic series, which can be created is an Empty Series.

Sr.No	Parameter & Description
1	data data takes various forms like ndarray, series, map, lists, dict, constants and also another DataFrame.
2	index For the row labels, the Index to be used for the resulting frame is Optional Default np.arange(n) if no index is passed.
3	columns For column labels, the optional default syntax is - np.arange(n). This is only true if no index is passed.
4	dtype Data type of each column.
5	copy This command (or whatever it is) is used for copying of data, if the default is False.

Parameter	Description
data	Data takes various forms like ndarray, series, map, lists, dict, constants and also another DataFrame
items	axis=0
major_axis	axis=1
minor_axis	axis=2
dtype	Data type of each column
copy	Copy data. Default, false

Sr.No.	Attribute or Method & Description
1	axes Returns a list of the row axis labels
2	dtype Returns the dtype of the object.
3	empty Returns True if series is empty.
4	ndim Returns the number of dimensions of the underlying data, by definition 1.
5	size Returns the number of elements in the underlying data.
6	values Returns the Series as ndarray.
7	head() Returns the first n rows.
8	tail() Returns the last n rows.

Sr.No.	Attribute or Method & Description
1	T Transposes rows and columns.
2	axes Returns a list with the row axis labels and column axis labels as the only members.
3	dtypes Returns the dtypes in this object.
4	empty True if NDFrame is entirely empty [no items]; if any of the axes are of length 0.
5	ndim Number of axes / array dimensions.
6	shape Returns a tuple representing the dimensionality of the DataFrame.
7	size Number of elements in the NDFrame.
8	values Numpy representation of NDFrame.
9	head() Returns the first n rows.
10	tail() Returns last n rows.

Sr.No.	Function	Description
1	count()	Number of non-null observations
2	sum()	Sum of values
3	mean()	Mean of Values
4	median()	Median of Values
5	mode()	Mode of values
6	std()	Standard Deviation of the Values
7	min()	Minimum Value
8	max()	Maximum Value
9	abs()	Absolute Value
10	prod()	Product of Values
11	cumsum()	Cumulative Sum
12	cumprod()	Cumulative Product

Sr.No	Function & Description
1	lower() Converts strings in the Series/Index to lower case.
2	upper() Converts strings in the Series/Index to upper case.
3	len() Computes String length().
4	strip() Helps strip whitespace(including newline) from each string in the Series/index from both the sides.
5	split(' ') Splits each string with the given pattern.
6	cat(sep=' ') Concatenates the series/index elements with given separator.
7	get_dummies() Returns the DataFrame with One-Hot Encoded values.
8	contains(pattern) Returns a Boolean value True for each element if the substring contains in the element, else False.
9	replace(a,b) Replaces the value a with the value b.
10	repeat(value) Repeats each element with specified number of times.
11	count(pattern) Returns count of appearance of pattern in each element.
12	startswith(pattern) Returns true if the element in the Series/Index starts with the pattern.
13	endswith(pattern) Returns true if the element in the Series/Index ends with the pattern.
14	find(pattern) Returns the first position of the first occurrence of the pattern.
15	findall(pattern) Returns a list of all occurrence of the pattern.
16	swapcase Swaps the case lower/upper.
17	islower() Checks whether all characters in each string in the Series/Index in lower case or not. Returns Boolean
18	isupper() Checks whether all characters in each string in the Series/Index in upper case or not. Returns Boolean.
19	isnumeric() Checks whether all characters in each string in the Series/Index are numeric. Returns Boolean.

Sr.No	Parameter & Description
1	display.max_rows Displays maximum number of rows to display
2	2 display.max_columns Displays maximum number of columns to display
3	display.expand_frame_repr Displays DataFrames to Stretch Pages
4	display.max_colwidth Displays maximum column width
5	display.precision Displays precision for decimal numbers

Object	Indexers	Return Type
Series	s.loc[indexer]	Scalar value
DataFrame	df.loc[row_index,col_index]	Series object
Panel	p.loc[item_index,major_index, minor_index]	p.loc[item_index,major_index, minor_index]

Merge Method	SQL Equivalent	Description
left	LEFT OUTER JOIN	Use keys from left object
right	RIGHT OUTER JOIN	Use keys from right object
outer	FULL OUTER JOIN	Use union of keys
inner	INNER JOIN	Use intersection of keys

Alias	Description	Alias	Description
B	business day frequency	BQS	business quarter start frequency
D	calendar day frequency	A	annual(Year) end frequency
W	weekly frequency	BA	business year end frequency
M	month end frequency	BAS	business year start frequency
SM	semi-month end frequency	BH	business hour frequency
BM	business month end frequency	H	hourly frequency
MS	month start frequency	T, min	minutely frequency
SMS	SMS semi month start frequency	S	secondly frequency
BMS	business month start frequency	L, ms	milliseconds
Q	quarter end frequency	U, us	microseconds
BQ	business quarter end frequency	N	nanoseconds
QS	quarter start frequency

Python Pandas - Quick Guide

Python Pandas - Introduction

Key Features of Pandas

Python Pandas - Environment Setup

Windows

Linux

Introduction to Data Structures

Dimension & Description

Mutability

Series

Key Points

DataFrame

Data Type of Columns

Key Points

Panel

Key Points

Python Pandas - Series

pandas.Series

Create an Empty Series

Example

Create a Series from ndarray

Example 1

Example 2

Create a Series from dict

Example 1

Example 2

Create a Series from Scalar

Accessing Data from Series with Position

Example 1

Example 2

Example 3

Retrieve Data Using Label (Index)

Example 1

Example 2

Example 3

Python Pandas - DataFrame

Features of DataFrame

Structure

pandas.DataFrame

Create DataFrame

Create an Empty DataFrame

Example

Create a DataFrame from Lists

Example 1

Example 2

Example 3

Create a DataFrame from Dict of ndarrays / Lists

Example 1

Example 2

Create a DataFrame from List of Dicts

Example 1

Example 2

Example 3

Create a DataFrame from Dict of Series

Example

Column Selection

Example

Column Addition

Example

Column Deletion

Example

Row Selection, Addition, and Deletion

Selection by Label

Selection by integer location

Slice Rows

Addition of Rows

Deletion of Rows

Python Pandas - Panel

pandas.Panel()

Create Panel

From 3D ndarray

From dict of DataFrame Objects

Create an Empty Panel

Selecting the Data from Panel

Using Items

Using major_axis

Using minor_axis

Python Pandas - Basic Functionality

Series Basic Functionality

Example