Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
Spaceship Titanic Project using Machine Learning in Python
The Spaceship Titanic project is a machine learning classification problem that predicts whether passengers will be transported to another dimension. Unlike the classic Titanic survival prediction, this futuristic scenario involves space travel and dimensional transportation.
This project demonstrates a complete machine learning pipeline from data preprocessing to model evaluation using Python libraries like pandas, scikit-learn, and XGBoost.
Dataset Overview
The Spaceship Titanic dataset contains passenger information with features like HomePlanet, CryoSleep status, Cabin details, Age, VIP status, and various service expenses. The target variable is Transported whether a passenger was transported to another dimension.
Data Preprocessing Steps
Loading and Exploring Data
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.svm import SVC
from xgboost import XGBClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report
import warnings
warnings.filterwarnings('ignore')
# Load dataset
df = pd.read_csv('/path/to/titanic_dataset.csv')
print(df.head())
print(df.info())
print(df.describe())
Handling Missing Values
The preprocessing involves identifying null values and filling them based on feature relationships ?
# Visualize null values
df.isnull().sum().plot.bar()
plt.title('Missing Values by Column')
plt.show()
# Fill service expenses based on CryoSleep and VIP status
service_cols = df.loc[:,'RoomService':'VRDeck'].columns
temp = df['CryoSleep'] == True
df.loc[temp, service_cols] = 0.0
# Fill remaining nulls based on VIP status
for col in service_cols:
for vip_status in [True, False]:
temp = df['VIP'] == vip_status
mean_val = df[temp][col].mean()
df.loc[temp, col] = df.loc[temp, col].fillna(mean_val)
# Handle HomePlanet based on VIP status
temp = df['VIP'] == False
df.loc[temp, 'HomePlanet'] = df.loc[temp, 'HomePlanet'].fillna('Earth')
temp = df['VIP'] == True
df.loc[temp, 'HomePlanet'] = df.loc[temp, 'HomePlanet'].fillna('Europa')
# Fill Age excluding outliers
age_mean = df[df['Age'] < 61]['Age'].mean()
df['Age'] = df['Age'].fillna(age_mean)
# Fill remaining nulls with mode/mean
for col in df.columns:
if df[col].isnull().sum() == 0:
continue
if df[col].dtype == 'object' or df[col].dtype == 'bool':
df[col] = df[col].fillna(df[col].mode()[0])
else:
df[col] = df[col].fillna(df[col].mean())
print("Remaining null values:", df.isnull().sum().sum())
Feature Engineering
Extract meaningful features from existing columns ?
# Split PassengerId into Room and Passenger numbers
passenger_parts = df["PassengerId"].str.split("_", expand=True)
df["RoomNo"] = passenger_parts[0].astype(int)
df["PassengerNo"] = passenger_parts[1].astype(int)
# Split Cabin into deck, room, and side
cabin_parts = df["Cabin"].str.split("/", expand=True)
df["Deck"] = cabin_parts[0]
df["CabinNum"] = cabin_parts[1].astype(int)
df["Side"] = cabin_parts[2]
# Create total leisure bill
df['TotalBill'] = (df['RoomService'] + df['FoodCourt'] +
df['ShoppingMall'] + df['Spa'] + df['VRDeck'])
# Drop original columns
df.drop(['PassengerId', 'Name', 'Cabin'], axis=1, inplace=True)
Exploratory Data Analysis
# Check target variable balance
transported_counts = df['Transported'].value_counts()
plt.pie(transported_counts.values, labels=transported_counts.index, autopct='%1.1f%%')
plt.title('Distribution of Transported Passengers')
plt.show()
# Encode categorical variables
for col in df.columns:
if df[col].dtype == 'object':
le = LabelEncoder()
df[col] = le.fit_transform(df[col])
elif df[col].dtype == 'bool':
df[col] = df[col].astype(int)
# Correlation heatmap
plt.figure(figsize=(12, 10))
correlation_matrix = df.corr()
sb.heatmap(correlation_matrix > 0.7, annot=True, cbar=False, cmap='coolwarm')
plt.title('Feature Correlation Heatmap')
plt.show()
Model Training and Evaluation
# Split data
features = df.drop(['Transported'], axis=1)
target = df['Transported']
X_train, X_val, y_train, y_val = train_test_split(
features, target, test_size=0.2, random_state=42
)
# Normalize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)
# Train multiple models
from sklearn.metrics import roc_auc_score
models = {
'Logistic Regression': LogisticRegression(),
'XGBoost': XGBClassifier(random_state=42),
'SVM': SVC(kernel='rbf', probability=True)
}
results = {}
for name, model in models.items():
model.fit(X_train_scaled, y_train)
train_proba = model.predict_proba(X_train_scaled)[:, 1]
val_proba = model.predict_proba(X_val_scaled)[:, 1]
train_auc = roc_auc_score(y_train, train_proba)
val_auc = roc_auc_score(y_val, val_proba)
results[name] = {'train_auc': train_auc, 'val_auc': val_auc}
print(f"{name}:")
print(f" Training AUC: {train_auc:.4f}")
print(f" Validation AUC: {val_auc:.4f}\n")
# Select best model (highest validation AUC)
best_model_name = max(results.keys(), key=lambda k: results[k]['val_auc'])
best_model = models[best_model_name]
# Generate predictions and confusion matrix
y_pred = best_model.predict(X_val_scaled)
cm = confusion_matrix(y_val, y_pred)
print(f"Best Model: {best_model_name}")
print("\nConfusion Matrix:")
print(cm)
print("\nClassification Report:")
print(classification_report(y_val, y_pred))
Model Comparison
| Model | Training AUC | Validation AUC | Characteristics |
|---|---|---|---|
| Logistic Regression | 0.892 | 0.806 | Good generalization, interpretable |
| XGBoost | 1.000 | 0.745 | Overfitting, high training accuracy |
| SVM | 0.927 | 0.788 | Balanced performance |
Key Insights
The analysis reveals several important patterns:
- CryoSleep passengers have zero service expenses, indicating they were in suspended animation
- VIP status correlates with higher service expenses and different home planets
- Age and cabin location influence transportation probability
- Logistic Regression shows the best balance between training and validation performance
Conclusion
The Spaceship Titanic project demonstrates a complete machine learning workflow for binary classification. Logistic Regression performed best with good generalization, while XGBoost showed overfitting. Proper feature engineering and handling of missing values are crucial for model performance in this space-themed prediction task.
