How to Utilize Time Series in Pandas?


Time series data are mostly used when dealing with data that changes with time. Handling these data plays a very important role in data analysis of Time series data. Pandas, a popular data manipulation and analysis library in Python, provides robust functionality for working with time series data. In this article, we will understand through examples and explanations how to effectively utilize time series in Pandas.

Ways to Utilize Time Series Data

In the below methods we will be using the Electric_ptoduction time series data set that is taken from Kaggle. You can download the data set from here.

Importing and Manipulating Time Series Data

While working with time series data in Pandas we need to first import the necessary libraries and load the data into a DataFrame. Pandas provide various methods to read time series data from different sources, including CSV files, databases, and web APIs. As the data is loaded, Pandas offers powerful tools to manipulate, clean, and preprocess time series data.

import pandas as pd

# Load time series data from a CSV file
data = pd.read_csv('Electric_Production.csv')

# Display the first few rows of the DataFrame
print(data.head())

# Set the 'timestamp' column as the index
data['DATE'] = pd.to_datetime(data['DATE'])
data.set_index('DATE', inplace=True)

# Resample the data to a daily frequency
daily_data = data.resample('D').mean()

Output

       DATE  IPG2211A2N
0  1/1/1985     72.5052
1  2/1/1985     70.6720
2  3/1/1985     62.4502
3  4/1/1985     57.4714
4  5/1/1985     55.3151

Indexing and Slicing Time Series Data

Pandas contain various indexing and slicing methods to extract specific time periods or observations from time series data. The DateTimeIndex in Pandas enables intuitive indexing and selection based on time.

import pandas as pd

# Load time series data from a CSV file
data = pd.read_csv('Electric_Production.csv')

# Set the 'timestamp' column as the index
data['DATE'] = pd.to_datetime(data['DATE'])
data.set_index('DATE', inplace=True)

# Resample the data to a daily frequency
daily_data = data.resample('D').mean()

# Select data for a specific date range
subset_1 = data['2017-01-01':'2017-10-30']
print(subset_1)

# Select data for a specific month
subset_2 = data[data.index.month == 3]
print(subset_2)

# Select data for a specific year
subset_3 = data[data.index.year == 2016]
print(subset_3)

Output

            IPG2211A2N
DATE
2017-01-01    114.8505
2017-02-01     99.4901
2017-03-01    101.0396
2017-04-01     88.3530
2017-05-01     92.0805
2017-06-01    102.1532
2017-07-01    112.1538
2017-08-01    108.9312
2017-09-01     98.6154
2017-10-01     93.6137
            IPG2211A2N
DATE
1985-03-01     62.4502
1986-03-01     62.2221
1987-03-01     65.6100
1988-03-01     70.2928
1989-03-01     73.3523
1990-03-01     73.1964
1991-03-01     73.3650
1992-03-01     74.5275
1993-03-01     79.4747
1994-03-01     79.2456
1995-03-01     81.2661
1996-03-01     86.9356
1997-03-01     83.0125
1998-03-01     86.5549
1999-03-01     90.7381
2000-03-01     88.0927
2001-03-01     92.8283
2002-03-01     93.2556
2003-03-01     94.5532
2004-03-01     95.4029
2005-03-01     98.9565
2006-03-01     98.4017
2007-03-01     99.1925
2008-03-01    100.4386
2009-03-01     97.8529
2010-03-01     98.2672
2011-03-01     99.1028
2012-03-01     93.5772
2013-03-01    102.9948
2014-03-01    104.7631
2015-03-01    104.4706
2016-03-01     95.3548
2017-03-01    101.0396
            IPG2211A2N
DATE
2016-01-01    117.0837
2016-02-01    106.6688
2016-03-01     95.3548
2016-04-01     89.3254
2016-05-01     90.7369
2016-06-01    104.0375
2016-07-01    114.5397
2016-08-01    115.5159
2016-09-01    102.7637
2016-10-01     91.4867
2016-11-01     92.8900
2016-12-01    112.7694

Handling Missing Data

Time series data often contains missing values, which can hinder analysis and modeling. Pandas offers several methods to handle missing data, such as interpolation, forward−fill, or backward−fill. These methods help ensure the continuity of the time series.

import pandas as pd

# Load time series data from a CSV file
data = pd.read_csv('Electric_Production.csv')

# Display the first few rows of the DataFrame
# print(data.head())

# Set the 'timestamp' column as the index
data['DATE'] = pd.to_datetime(data['DATE'])
data.set_index('DATE', inplace=True)

# Resample the data to a daily frequency
daily_data = data.resample('D').mean()

## Interpolate missing values
data['value'] = data['value'].interpolate()
print(data.head())

# Forward-fill missing values
data['value'] = data['value'].ffill()
print(data.head())

# Backward-fill missing values
data['value'] = data['value'].bfill()
print(data.head())

Output

               value
DATE
1985-01-01  72.5052
1985-02-01  70.6720
1985-03-01  64.0717
1985-04-01  57.4714
1985-05-01  55.3151
              value
DATE
1985-01-01  72.5052
1985-02-01  70.6720
1985-03-01  64.0717
1985-04-01  57.4714
1985-05-01  55.3151
              value
DATE
1985-01-01  72.5052
1985-02-01  70.6720
1985-03-01  64.0717
1985-04-01  57.4714
1985-05-01  55.3151

Resampling and Frequency Conversion

Resampling involves changing the frequency of the time series data. Pandas provides methods for both upsampling (increasing the frequency) and downsampling (decreasing the frequency) of time series data. This allows for aggregation or interpolation of data at different time intervals.

import pandas as pd

# Load time series data from a CSV file
data = pd.read_csv('Electric_Production.csv')

# Display the first few rows of the DataFrame
# print(data.head())

# Set the 'timestamp' column as the index
data['DATE'] = pd.to_datetime(data['DATE'])
data.set_index('DATE', inplace=True)

# Resample the data to a daily frequency
daily_data = data.resample('D').mean()
print(daily_data.head())

# Resample the data to a weekly frequency, taking the mean value
weekly_data = data.resample('W').mean()
print(weekly_data.head())

# Resample the data to a monthly frequency, taking the sum value
monthly_data = data.resample('M').sum()
print(weekly_data.head())

Output

              value
DATE
1985-01-01  72.5052
1985-01-02      NaN
1985-01-03      NaN
1985-01-04      NaN
1985-01-05      NaN
              value
DATE
1985-01-06  72.5052
1985-01-13      NaN
1985-01-20      NaN
1985-01-27      NaN
1985-02-03  70.6720
              value
DATE
1985-01-06  72.5052
1985-01-13      NaN
1985-01-20      NaN
1985-01-27      NaN
1985-02-03  70.6720

Plotting and Visualizing Time Series Data

Pandas integrates with Matplotlib, a popular data visualization library, making it easy to create insightful plots and visualizations of time series data. Visualizations can aid in understanding trends, patterns, and anomalies in the data

import pandas as pd
import matplotlib.pyplot as plt

# Load time series data from a CSV file
data = pd.read_csv('Electric_Production.csv')

# Display the first few rows of the DataFrame
# print(data.head())

# Set the 'timestamp' column as the index
data['DATE'] = pd.to_datetime(data['DATE'])
data.set_index('DATE', inplace=True)

# Plot the time series data
data.plot()
plt.title('Time Series Data')
plt.xlabel('Date')
plt.ylabel('Value')
plt.show()

Output

Conclusion

In this article, we discussed how we can use time series data using pandas functionalities. From importing and preprocessing data to advanced analysis and visualization, Pandas simplifies the entire time series analysis workflow. By leveraging the functionalities discussed in this article, analysts and data scientists can gain valuable insights and make informed decisions based on time−based data.

Updated on: 18-Jul-2023

48 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements