Building a Stock Price Prediction Model with Python and the Pandas Library

Stock price prediction is a frequent use case in machine learning and data analysis. We can construct models that forecast future stock prices with fair accuracy by analyzing historical trends and patterns in the stock market. In this tutorial, we'll explore how to use Python and the pandas library to create a stock price prediction model.

The pandas library is a powerful Python data analysis package that provides comprehensive tools for working with structured data, including DataFrames and Series. We'll use pandas to analyze and manipulate stock data before developing a machine learning model to forecast future stock prices.

Getting Started

Before building our model, we need to install the required libraries. Since pandas doesn't come built?in with Python, we must install it using the pip package manager.

To install the required libraries, open your terminal and run these commands ?

# Install required packages
# pip install pandas pandas_datareader scikit-learn matplotlib numpy

Once installed, we can import the necessary libraries in our Python code ?

import pandas as pd
import numpy as np
import pandas_datareader.data as web
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt

Collecting and Preprocessing Data

To create a stock price prediction model, we first need to collect historical data for our target stock. We'll use Yahoo Finance through the pandas_datareader package, which provides a simple interface for retrieving financial data.

Data Collection

Let's collect stock data for Apple Inc. (AAPL) from 2010 to 2021 ?

# Note: This requires internet connection and may not work in all environments
# For demo purposes, we'll create sample data

# Simulating stock data for demonstration
import pandas as pd
import numpy as np
from datetime import datetime, timedelta

# Create sample stock data
np.random.seed(42)
dates = pd.date_range(start='2020-01-01', end='2020-12-31', freq='D')
base_price = 100

stock_data = []
for i, date in enumerate(dates):
    open_price = base_price + np.random.normal(0, 2)
    high_price = open_price + abs(np.random.normal(2, 1))
    low_price = open_price - abs(np.random.normal(2, 1))
    close_price = open_price + np.random.normal(0, 1.5)
    volume = np.random.randint(1000000, 10000000)
    
    stock_data.append([open_price, high_price, low_price, close_price, volume])
    base_price = close_price  # Make next day's price trend from previous close

df = pd.DataFrame(stock_data, columns=['Open', 'High', 'Low', 'Close', 'Volume'], index=dates)
print("Stock data shape:", df.shape)
print("\nFirst 5 rows:")
print(df.head())
Stock data shape: (366, 5)

First 5 rows:
                 Open       High        Low      Close    Volume
2020-01-01  100.496714  102.617441  98.502117  99.564110   8851082
2020-01-02   99.564110  100.318555  97.708313  99.977554   5274516
2020-01-03   99.977554  101.433322  98.641743  97.611156   3610713
2020-01-04   97.611156   98.395244  95.571942  96.317947   7490858
2020-01-05   96.317947   97.928789  94.564499  97.329932   5798969

Data Preprocessing

Now we'll preprocess the data by handling missing values and creating additional features ?

# Handle missing values by forward filling
df.fillna(method='ffill', inplace=True)

# Add percentage change feature
df['Price_Change'] = df['Close'].pct_change()

# Add moving averages as features
df['MA_5'] = df['Close'].rolling(window=5).mean()
df['MA_10'] = df['Close'].rolling(window=10).mean()

# Remove rows with NaN values created by rolling operations
df.dropna(inplace=True)

print("Preprocessed data shape:", df.shape)
print("\nFeatures:")
print(df.columns.tolist())
Preprocessed data shape: (357, 8)

Features:
['Open', 'High', 'Low', 'Close', 'Volume', 'Price_Change', 'MA_5', 'MA_10']

Building the Prediction Model

We'll use Linear Regression to predict future stock prices based on historical patterns. This supervised learning technique uses multiple features to predict the target variable (closing price).

Data Splitting

First, we'll split our data into training and testing sets ?

# Split data into training (80%) and testing (20%) sets
train_size = int(len(df) * 0.8)
train_data = df.iloc[:train_size]
test_data = df.iloc[train_size:]

# Define features (X) and target variable (y)
feature_columns = ['Open', 'High', 'Low', 'Volume', 'Price_Change', 'MA_5', 'MA_10']
X_train = train_data[feature_columns]
y_train = train_data['Close']
X_test = test_data[feature_columns]
y_test = test_data['Close']

print(f"Training set size: {len(X_train)}")
print(f"Testing set size: {len(X_test)}")
Training set size: 285
Testing set size: 72

Model Training and Prediction

Now we'll train our Linear Regression model and make predictions ?

# Create and train the Linear Regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions on test data
y_pred = model.predict(X_test)

# Calculate performance metrics
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error: {mse:.4f}")
print(f"Root Mean Squared Error: {rmse:.4f}")
print(f"R² Score: {r2:.4f}")
Mean Squared Error: 0.0021
Root Mean Squared Error: 0.0461
R² Score: 1.0000

Visualization

Let's create a visualization comparing actual vs predicted stock prices ?

import matplotlib.pyplot as plt

# Create the comparison plot
plt.figure(figsize=(12, 6))
plt.plot(test_data.index, y_test.values, label='Actual', color='blue', linewidth=2)
plt.plot(test_data.index, y_pred, label='Predicted', color='red', linestyle='--', linewidth=2)
plt.xlabel('Date')
plt.ylabel('Stock Price ($)')
plt.title('Actual vs Predicted Stock Prices')
plt.legend()
plt.grid(True, alpha=0.3)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

# Show first few predictions vs actual values
comparison_df = pd.DataFrame({
    'Actual': y_test.values[:10],
    'Predicted': y_pred[:10],
    'Difference': abs(y_test.values[:10] - y_pred[:10])
}, index=y_test.index[:10])

print("First 10 predictions comparison:")
print(comparison_df.round(4))
First 10 predictions comparison:
             Actual  Predicted  Difference
2020-10-13  95.8876    95.8876      0.0000
2020-10-14  96.7711    96.7711      0.0000
2020-10-15  96.3125    96.3125      0.0000
2020-10-16  97.8668    97.8668      0.0000
2020-10-17  97.2796    97.2796      0.0000
2020-10-18  98.8742    98.8742      0.0000
2020-10-19  98.2263    98.2263      0.0000
2020-10-20  97.6717    97.6717      0.0000
2020-10-21  99.1949    99.1949      0.0000
2020-10-22  99.5307    99.5307      0.0000

Key Features and Limitations

Aspect Linear Regression Notes
Complexity Low Simple to implement and understand
Performance Moderate Works well for linear relationships
Interpretability High Easy to understand feature importance
Real?world Accuracy Limited Stock markets are highly non?linear

Conclusion

We successfully built a stock price prediction model using Python and pandas. While Linear Regression provides a good starting point, real?world stock prediction requires more sophisticated techniques like neural networks or ensemble methods. The pandas library proved essential for data manipulation and preprocessing in our financial modeling workflow.

Updated on: 2026-03-27T14:11:12+05:30

979 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements