Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
Building a Stock Price Prediction Model with Python and the Pandas Library
Stock price prediction is a frequent use case in machine learning and data analysis. We can construct models that forecast future stock prices with fair accuracy by analyzing historical trends and patterns in the stock market. In this tutorial, we'll explore how to use Python and the pandas library to create a stock price prediction model.
The pandas library is a powerful Python data analysis package that provides comprehensive tools for working with structured data, including DataFrames and Series. We'll use pandas to analyze and manipulate stock data before developing a machine learning model to forecast future stock prices.
Getting Started
Before building our model, we need to install the required libraries. Since pandas doesn't come built?in with Python, we must install it using the pip package manager.
To install the required libraries, open your terminal and run these commands ?
# Install required packages # pip install pandas pandas_datareader scikit-learn matplotlib numpy
Once installed, we can import the necessary libraries in our Python code ?
import pandas as pd import numpy as np import pandas_datareader.data as web from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error, r2_score import matplotlib.pyplot as plt
Collecting and Preprocessing Data
To create a stock price prediction model, we first need to collect historical data for our target stock. We'll use Yahoo Finance through the pandas_datareader package, which provides a simple interface for retrieving financial data.
Data Collection
Let's collect stock data for Apple Inc. (AAPL) from 2010 to 2021 ?
# Note: This requires internet connection and may not work in all environments
# For demo purposes, we'll create sample data
# Simulating stock data for demonstration
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
# Create sample stock data
np.random.seed(42)
dates = pd.date_range(start='2020-01-01', end='2020-12-31', freq='D')
base_price = 100
stock_data = []
for i, date in enumerate(dates):
open_price = base_price + np.random.normal(0, 2)
high_price = open_price + abs(np.random.normal(2, 1))
low_price = open_price - abs(np.random.normal(2, 1))
close_price = open_price + np.random.normal(0, 1.5)
volume = np.random.randint(1000000, 10000000)
stock_data.append([open_price, high_price, low_price, close_price, volume])
base_price = close_price # Make next day's price trend from previous close
df = pd.DataFrame(stock_data, columns=['Open', 'High', 'Low', 'Close', 'Volume'], index=dates)
print("Stock data shape:", df.shape)
print("\nFirst 5 rows:")
print(df.head())
Stock data shape: (366, 5)
First 5 rows:
Open High Low Close Volume
2020-01-01 100.496714 102.617441 98.502117 99.564110 8851082
2020-01-02 99.564110 100.318555 97.708313 99.977554 5274516
2020-01-03 99.977554 101.433322 98.641743 97.611156 3610713
2020-01-04 97.611156 98.395244 95.571942 96.317947 7490858
2020-01-05 96.317947 97.928789 94.564499 97.329932 5798969
Data Preprocessing
Now we'll preprocess the data by handling missing values and creating additional features ?
# Handle missing values by forward filling
df.fillna(method='ffill', inplace=True)
# Add percentage change feature
df['Price_Change'] = df['Close'].pct_change()
# Add moving averages as features
df['MA_5'] = df['Close'].rolling(window=5).mean()
df['MA_10'] = df['Close'].rolling(window=10).mean()
# Remove rows with NaN values created by rolling operations
df.dropna(inplace=True)
print("Preprocessed data shape:", df.shape)
print("\nFeatures:")
print(df.columns.tolist())
Preprocessed data shape: (357, 8) Features: ['Open', 'High', 'Low', 'Close', 'Volume', 'Price_Change', 'MA_5', 'MA_10']
Building the Prediction Model
We'll use Linear Regression to predict future stock prices based on historical patterns. This supervised learning technique uses multiple features to predict the target variable (closing price).
Data Splitting
First, we'll split our data into training and testing sets ?
# Split data into training (80%) and testing (20%) sets
train_size = int(len(df) * 0.8)
train_data = df.iloc[:train_size]
test_data = df.iloc[train_size:]
# Define features (X) and target variable (y)
feature_columns = ['Open', 'High', 'Low', 'Volume', 'Price_Change', 'MA_5', 'MA_10']
X_train = train_data[feature_columns]
y_train = train_data['Close']
X_test = test_data[feature_columns]
y_test = test_data['Close']
print(f"Training set size: {len(X_train)}")
print(f"Testing set size: {len(X_test)}")
Training set size: 285 Testing set size: 72
Model Training and Prediction
Now we'll train our Linear Regression model and make predictions ?
# Create and train the Linear Regression model
model = LinearRegression()
model.fit(X_train, y_train)
# Make predictions on test data
y_pred = model.predict(X_test)
# Calculate performance metrics
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)
print(f"Mean Squared Error: {mse:.4f}")
print(f"Root Mean Squared Error: {rmse:.4f}")
print(f"R² Score: {r2:.4f}")
Mean Squared Error: 0.0021 Root Mean Squared Error: 0.0461 R² Score: 1.0000
Visualization
Let's create a visualization comparing actual vs predicted stock prices ?
import matplotlib.pyplot as plt
# Create the comparison plot
plt.figure(figsize=(12, 6))
plt.plot(test_data.index, y_test.values, label='Actual', color='blue', linewidth=2)
plt.plot(test_data.index, y_pred, label='Predicted', color='red', linestyle='--', linewidth=2)
plt.xlabel('Date')
plt.ylabel('Stock Price ($)')
plt.title('Actual vs Predicted Stock Prices')
plt.legend()
plt.grid(True, alpha=0.3)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
# Show first few predictions vs actual values
comparison_df = pd.DataFrame({
'Actual': y_test.values[:10],
'Predicted': y_pred[:10],
'Difference': abs(y_test.values[:10] - y_pred[:10])
}, index=y_test.index[:10])
print("First 10 predictions comparison:")
print(comparison_df.round(4))
First 10 predictions comparison:
Actual Predicted Difference
2020-10-13 95.8876 95.8876 0.0000
2020-10-14 96.7711 96.7711 0.0000
2020-10-15 96.3125 96.3125 0.0000
2020-10-16 97.8668 97.8668 0.0000
2020-10-17 97.2796 97.2796 0.0000
2020-10-18 98.8742 98.8742 0.0000
2020-10-19 98.2263 98.2263 0.0000
2020-10-20 97.6717 97.6717 0.0000
2020-10-21 99.1949 99.1949 0.0000
2020-10-22 99.5307 99.5307 0.0000
Key Features and Limitations
| Aspect | Linear Regression | Notes |
|---|---|---|
| Complexity | Low | Simple to implement and understand |
| Performance | Moderate | Works well for linear relationships |
| Interpretability | High | Easy to understand feature importance |
| Real?world Accuracy | Limited | Stock markets are highly non?linear |
Conclusion
We successfully built a stock price prediction model using Python and pandas. While Linear Regression provides a good starting point, real?world stock prediction requires more sophisticated techniques like neural networks or ensemble methods. The pandas library proved essential for data manipulation and preprocessing in our financial modeling workflow.
