Article Categories

Selected Reading

Spaceship Titanic Project using Machine Learning in Python

Python Machine Learning Algorithms

The Spaceship Titanic project is a machine learning classification problem that predicts whether passengers will be transported to another dimension. Unlike the classic Titanic survival prediction, this futuristic scenario involves space travel and dimensional transportation.

This project demonstrates a complete machine learning pipeline from data preprocessing to model evaluation using Python libraries like pandas, scikit-learn, and XGBoost.

Dataset Overview

The Spaceship Titanic dataset contains passenger information with features like HomePlanet, CryoSleep status, Cabin details, Age, VIP status, and various service expenses. The target variable is Transported whether a passenger was transported to another dimension.

Data Preprocessing Steps

Loading and Exploring Data

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.svm import SVC
from xgboost import XGBClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report
import warnings
warnings.filterwarnings('ignore')

# Load dataset
df = pd.read_csv('/path/to/titanic_dataset.csv')
print(df.head())
print(df.info())
print(df.describe())

Handling Missing Values

The preprocessing involves identifying null values and filling them based on feature relationships ?

# Visualize null values
df.isnull().sum().plot.bar()
plt.title('Missing Values by Column')
plt.show()

# Fill service expenses based on CryoSleep and VIP status
service_cols = df.loc[:,'RoomService':'VRDeck'].columns
temp = df['CryoSleep'] == True
df.loc[temp, service_cols] = 0.0

# Fill remaining nulls based on VIP status
for col in service_cols:
    for vip_status in [True, False]:
        temp = df['VIP'] == vip_status
        mean_val = df[temp][col].mean()
        df.loc[temp, col] = df.loc[temp, col].fillna(mean_val)

# Handle HomePlanet based on VIP status
temp = df['VIP'] == False
df.loc[temp, 'HomePlanet'] = df.loc[temp, 'HomePlanet'].fillna('Earth')
temp = df['VIP'] == True
df.loc[temp, 'HomePlanet'] = df.loc[temp, 'HomePlanet'].fillna('Europa')

# Fill Age excluding outliers
age_mean = df[df['Age'] < 61]['Age'].mean()
df['Age'] = df['Age'].fillna(age_mean)

# Fill remaining nulls with mode/mean
for col in df.columns:
    if df[col].isnull().sum() == 0:
        continue
    if df[col].dtype == 'object' or df[col].dtype == 'bool':
        df[col] = df[col].fillna(df[col].mode()[0])
    else:
        df[col] = df[col].fillna(df[col].mean())

print("Remaining null values:", df.isnull().sum().sum())

Feature Engineering

Extract meaningful features from existing columns ?

# Split PassengerId into Room and Passenger numbers
passenger_parts = df["PassengerId"].str.split("_", expand=True)
df["RoomNo"] = passenger_parts[0].astype(int)
df["PassengerNo"] = passenger_parts[1].astype(int)

# Split Cabin into deck, room, and side
cabin_parts = df["Cabin"].str.split("/", expand=True)
df["Deck"] = cabin_parts[0]
df["CabinNum"] = cabin_parts[1].astype(int)
df["Side"] = cabin_parts[2]

# Create total leisure bill
df['TotalBill'] = (df['RoomService'] + df['FoodCourt'] + 
                   df['ShoppingMall'] + df['Spa'] + df['VRDeck'])

# Drop original columns
df.drop(['PassengerId', 'Name', 'Cabin'], axis=1, inplace=True)

Exploratory Data Analysis

# Check target variable balance
transported_counts = df['Transported'].value_counts()
plt.pie(transported_counts.values, labels=transported_counts.index, autopct='%1.1f%%')
plt.title('Distribution of Transported Passengers')
plt.show()

# Encode categorical variables
for col in df.columns:
    if df[col].dtype == 'object':
        le = LabelEncoder()
        df[col] = le.fit_transform(df[col])
    elif df[col].dtype == 'bool':
        df[col] = df[col].astype(int)

# Correlation heatmap
plt.figure(figsize=(12, 10))
correlation_matrix = df.corr()
sb.heatmap(correlation_matrix > 0.7, annot=True, cbar=False, cmap='coolwarm')
plt.title('Feature Correlation Heatmap')
plt.show()

Model Training and Evaluation

# Split data
features = df.drop(['Transported'], axis=1)
target = df['Transported']

X_train, X_val, y_train, y_val = train_test_split(
    features, target, test_size=0.2, random_state=42
)

# Normalize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)

# Train multiple models
from sklearn.metrics import roc_auc_score

models = {
    'Logistic Regression': LogisticRegression(),
    'XGBoost': XGBClassifier(random_state=42),
    'SVM': SVC(kernel='rbf', probability=True)
}

results = {}
for name, model in models.items():
    model.fit(X_train_scaled, y_train)
    
    train_proba = model.predict_proba(X_train_scaled)[:, 1]
    val_proba = model.predict_proba(X_val_scaled)[:, 1]
    
    train_auc = roc_auc_score(y_train, train_proba)
    val_auc = roc_auc_score(y_val, val_proba)
    
    results[name] = {'train_auc': train_auc, 'val_auc': val_auc}
    print(f"{name}:")
    print(f"  Training AUC: {train_auc:.4f}")
    print(f"  Validation AUC: {val_auc:.4f}\n")

# Select best model (highest validation AUC)
best_model_name = max(results.keys(), key=lambda k: results[k]['val_auc'])
best_model = models[best_model_name]

# Generate predictions and confusion matrix
y_pred = best_model.predict(X_val_scaled)
cm = confusion_matrix(y_val, y_pred)

print(f"Best Model: {best_model_name}")
print("\nConfusion Matrix:")
print(cm)
print("\nClassification Report:")
print(classification_report(y_val, y_pred))

Model Comparison

Model	Training AUC	Validation AUC	Characteristics
Logistic Regression	0.892	0.806	Good generalization, interpretable
XGBoost	1.000	0.745	Overfitting, high training accuracy
SVM	0.927	0.788	Balanced performance

Key Insights

The analysis reveals several important patterns:

CryoSleep passengers have zero service expenses, indicating they were in suspended animation
VIP status correlates with higher service expenses and different home planets
Age and cabin location influence transportation probability
Logistic Regression shows the best balance between training and validation performance

Conclusion

The Spaceship Titanic project demonstrates a complete machine learning workflow for binary classification. Logistic Regression performed best with good generalization, while XGBoost showed overfitting. Proper feature engineering and handling of missing values are crucial for model performance in this space-themed prediction task.

Jaisshree

Updated on: 2026-03-27T09:11:33+05:30

555 Views

Previous Next