Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
How can data be split and inspected to predict the fuel efficiency with Auto MPG dataset using TensorFlow?
TensorFlow is a machine learning framework provided by Google for implementing algorithms, deep learning applications, and neural networks. It uses multi-dimensional arrays called tensors to perform complex mathematical operations efficiently.
The Auto MPG dataset contains fuel efficiency data of automobiles from the 1970s and 1980s. It includes attributes like weight, horsepower, displacement, and cylinders. Our goal is to predict the fuel efficiency (MPG) of vehicles using regression techniques.
We are using Google Colaboratory to run the code. Google Colab provides free access to GPUs and requires zero configuration for running Python code.
Dataset Preparation and Splitting
Before training a model, we need to split our data into training and testing sets. Here's how to split and inspect the Auto MPG dataset ?
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
# Load the Auto MPG dataset (assuming dataset is already loaded and cleaned)
# For demonstration, let's assume 'dataset' is our cleaned DataFrame
print("Splitting the training and testing dataset")
train_dataset = dataset.sample(frac=0.7, random_state=0)
test_dataset = dataset.drop(train_dataset.index)
print(f"Training set size: {len(train_dataset)}")
print(f"Test set size: {len(test_dataset)}")
Splitting the training and testing dataset Training set size: 274 Test set size: 118
Data Visualization
Visualizing the training data helps us understand relationships between different features ?
print("Plotting the training data as a visualization")
sns.pairplot(train_dataset[['MPG', 'Cylinders', 'Displacement', 'Weight']], diag_kind='kde')
plt.show()
Plotting the training data as a visualization
Statistical Analysis
Understanding the statistical properties of our data is crucial for preprocessing ?
print("Understanding the statistics associated with the data")
stats_summary = train_dataset.describe().transpose()
print(stats_summary)
Understanding the statistics associated with the data
count mean std min 25% 50% 75% max
MPG 274.0 23.51 7.83 9.00 17.50 23.00 29.00 46.60
Cylinders 274.0 5.48 1.70 3.00 4.00 4.00 8.00 8.00
Displacement 274.0 193.43 104.27 68.00 104.25 148.50 265.75 455.00
Weight 274.0 2990.25 843.90 1613.00 2256.50 2822.50 3608.00 5140.00
Data Split Analysis
| Dataset | Percentage | Purpose | Size (approx.) |
|---|---|---|---|
| Training | 70% | Model training | 274 samples |
| Testing | 30% | Model evaluation | 118 samples |
Key Insights from Statistics
The statistical summary reveals important characteristics:
- MPG range: 9.0 to 46.6 miles per gallon
- Cylinders: Most cars have 4 or 8 cylinders
- Weight correlation: Heavier cars typically have lower fuel efficiency
- Displacement: Wide range from 68 to 455 cubic inches
Conclusion
Data splitting into 70% training and 30% testing sets ensures proper model evaluation. The statistical analysis and visualizations help identify feature relationships and data distributions, which are essential for building an effective regression model to predict fuel efficiency.
