- Data Structure
- Networking
- RDBMS
- Operating System
- Java
- MS Excel
- iOS
- HTML
- CSS
- Android
- Python
- C Programming
- C++
- C#
- MongoDB
- MySQL
- Javascript
- PHP
- Physics
- Chemistry
- Biology
- Mathematics
- English
- Economics
- Psychology
- Social Studies
- Fashion Studies
- Legal Studies
- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who
How to Utilize Time Series in Pandas?
Time series data are mostly used when dealing with data that changes with time. Handling these data plays a very important role in data analysis of Time series data. Pandas, a popular data manipulation and analysis library in Python, provides robust functionality for working with time series data. In this article, we will understand through examples and explanations how to effectively utilize time series in Pandas.
Ways to Utilize Time Series Data
In the below methods we will be using the Electric_ptoduction time series data set that is taken from Kaggle. You can download the data set from here.
Importing and Manipulating Time Series Data
While working with time series data in Pandas we need to first import the necessary libraries and load the data into a DataFrame. Pandas provide various methods to read time series data from different sources, including CSV files, databases, and web APIs. As the data is loaded, Pandas offers powerful tools to manipulate, clean, and preprocess time series data.
import pandas as pd # Load time series data from a CSV file data = pd.read_csv('Electric_Production.csv') # Display the first few rows of the DataFrame print(data.head()) # Set the 'timestamp' column as the index data['DATE'] = pd.to_datetime(data['DATE']) data.set_index('DATE', inplace=True) # Resample the data to a daily frequency daily_data = data.resample('D').mean()
Output
DATE IPG2211A2N 0 1/1/1985 72.5052 1 2/1/1985 70.6720 2 3/1/1985 62.4502 3 4/1/1985 57.4714 4 5/1/1985 55.3151
Indexing and Slicing Time Series Data
Pandas contain various indexing and slicing methods to extract specific time periods or observations from time series data. The DateTimeIndex in Pandas enables intuitive indexing and selection based on time.
import pandas as pd # Load time series data from a CSV file data = pd.read_csv('Electric_Production.csv') # Set the 'timestamp' column as the index data['DATE'] = pd.to_datetime(data['DATE']) data.set_index('DATE', inplace=True) # Resample the data to a daily frequency daily_data = data.resample('D').mean() # Select data for a specific date range subset_1 = data['2017-01-01':'2017-10-30'] print(subset_1) # Select data for a specific month subset_2 = data[data.index.month == 3] print(subset_2) # Select data for a specific year subset_3 = data[data.index.year == 2016] print(subset_3)
Output
IPG2211A2N DATE 2017-01-01 114.8505 2017-02-01 99.4901 2017-03-01 101.0396 2017-04-01 88.3530 2017-05-01 92.0805 2017-06-01 102.1532 2017-07-01 112.1538 2017-08-01 108.9312 2017-09-01 98.6154 2017-10-01 93.6137 IPG2211A2N DATE 1985-03-01 62.4502 1986-03-01 62.2221 1987-03-01 65.6100 1988-03-01 70.2928 1989-03-01 73.3523 1990-03-01 73.1964 1991-03-01 73.3650 1992-03-01 74.5275 1993-03-01 79.4747 1994-03-01 79.2456 1995-03-01 81.2661 1996-03-01 86.9356 1997-03-01 83.0125 1998-03-01 86.5549 1999-03-01 90.7381 2000-03-01 88.0927 2001-03-01 92.8283 2002-03-01 93.2556 2003-03-01 94.5532 2004-03-01 95.4029 2005-03-01 98.9565 2006-03-01 98.4017 2007-03-01 99.1925 2008-03-01 100.4386 2009-03-01 97.8529 2010-03-01 98.2672 2011-03-01 99.1028 2012-03-01 93.5772 2013-03-01 102.9948 2014-03-01 104.7631 2015-03-01 104.4706 2016-03-01 95.3548 2017-03-01 101.0396 IPG2211A2N DATE 2016-01-01 117.0837 2016-02-01 106.6688 2016-03-01 95.3548 2016-04-01 89.3254 2016-05-01 90.7369 2016-06-01 104.0375 2016-07-01 114.5397 2016-08-01 115.5159 2016-09-01 102.7637 2016-10-01 91.4867 2016-11-01 92.8900 2016-12-01 112.7694
Handling Missing Data
Time series data often contains missing values, which can hinder analysis and modeling. Pandas offers several methods to handle missing data, such as interpolation, forward−fill, or backward−fill. These methods help ensure the continuity of the time series.
import pandas as pd # Load time series data from a CSV file data = pd.read_csv('Electric_Production.csv') # Display the first few rows of the DataFrame # print(data.head()) # Set the 'timestamp' column as the index data['DATE'] = pd.to_datetime(data['DATE']) data.set_index('DATE', inplace=True) # Resample the data to a daily frequency daily_data = data.resample('D').mean() ## Interpolate missing values data['value'] = data['value'].interpolate() print(data.head()) # Forward-fill missing values data['value'] = data['value'].ffill() print(data.head()) # Backward-fill missing values data['value'] = data['value'].bfill() print(data.head())
Output
value DATE 1985-01-01 72.5052 1985-02-01 70.6720 1985-03-01 64.0717 1985-04-01 57.4714 1985-05-01 55.3151 value DATE 1985-01-01 72.5052 1985-02-01 70.6720 1985-03-01 64.0717 1985-04-01 57.4714 1985-05-01 55.3151 value DATE 1985-01-01 72.5052 1985-02-01 70.6720 1985-03-01 64.0717 1985-04-01 57.4714 1985-05-01 55.3151
Resampling and Frequency Conversion
Resampling involves changing the frequency of the time series data. Pandas provides methods for both upsampling (increasing the frequency) and downsampling (decreasing the frequency) of time series data. This allows for aggregation or interpolation of data at different time intervals.
import pandas as pd # Load time series data from a CSV file data = pd.read_csv('Electric_Production.csv') # Display the first few rows of the DataFrame # print(data.head()) # Set the 'timestamp' column as the index data['DATE'] = pd.to_datetime(data['DATE']) data.set_index('DATE', inplace=True) # Resample the data to a daily frequency daily_data = data.resample('D').mean() print(daily_data.head()) # Resample the data to a weekly frequency, taking the mean value weekly_data = data.resample('W').mean() print(weekly_data.head()) # Resample the data to a monthly frequency, taking the sum value monthly_data = data.resample('M').sum() print(weekly_data.head())
Output
value DATE 1985-01-01 72.5052 1985-01-02 NaN 1985-01-03 NaN 1985-01-04 NaN 1985-01-05 NaN value DATE 1985-01-06 72.5052 1985-01-13 NaN 1985-01-20 NaN 1985-01-27 NaN 1985-02-03 70.6720 value DATE 1985-01-06 72.5052 1985-01-13 NaN 1985-01-20 NaN 1985-01-27 NaN 1985-02-03 70.6720
Plotting and Visualizing Time Series Data
Pandas integrates with Matplotlib, a popular data visualization library, making it easy to create insightful plots and visualizations of time series data. Visualizations can aid in understanding trends, patterns, and anomalies in the data
import pandas as pd import matplotlib.pyplot as plt # Load time series data from a CSV file data = pd.read_csv('Electric_Production.csv') # Display the first few rows of the DataFrame # print(data.head()) # Set the 'timestamp' column as the index data['DATE'] = pd.to_datetime(data['DATE']) data.set_index('DATE', inplace=True) # Plot the time series data data.plot() plt.title('Time Series Data') plt.xlabel('Date') plt.ylabel('Value') plt.show()
Output
Conclusion
In this article, we discussed how we can use time series data using pandas functionalities. From importing and preprocessing data to advanced analysis and visualization, Pandas simplifies the entire time series analysis workflow. By leveraging the functionalities discussed in this article, analysts and data scientists can gain valuable insights and make informed decisions based on time−based data.