Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
How can data be cleaned to predict the fuel efficiency with Auto MPG dataset using TensorFlow?
TensorFlow is a machine learning framework provided by Google. It is an open-source framework used in conjunction with Python to implement algorithms, deep learning applications and much more. The Auto MPG dataset contains fuel efficiency data from 1970s and 1980s automobiles, which we'll clean to prepare for predicting vehicle fuel efficiency.
Installing TensorFlow
The 'tensorflow' package can be installed on Windows using the below command:
pip install tensorflow
About the Auto MPG Dataset
The Auto MPG dataset contains fuel efficiency information for automobiles from the 1970s and 1980s. It includes attributes like:
MPG - Miles per gallon (target variable)
Cylinders - Number of cylinders
Displacement - Engine displacement
Horsepower - Engine horsepower
Weight - Vehicle weight
Origin - Country of origin (1=USA, 2=Europe, 3=Japan)
Data Cleaning Process
Data cleaning is essential before training any machine learning model. Here's how to clean the Auto MPG dataset ?
import pandas as pd
import tensorflow as tf
# Load the dataset
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data'
column_names = ['MPG', 'Cylinders', 'Displacement', 'Horsepower', 'Weight',
'Acceleration', 'Model Year', 'Origin']
dataset = pd.read_csv(url, names=column_names, na_values='?', comment='\t',
sep=' ', skipinitialspace=True)
print("Data cleaning has begun")
print("Missing values per column:")
print(dataset.isna().sum())
# Remove rows with missing values
dataset = dataset.dropna()
# Map origin numbers to country names
dataset['Origin'] = dataset['Origin'].map({1: 'USA', 2: 'Europe', 3: 'Japan'})
print("Data cleaning complete!")
# Convert categorical variables to dummy variables
dataset = pd.get_dummies(dataset, prefix='', prefix_sep='')
print("A sample of dataset after data cleaning:")
print(dataset.head(4))
Data cleaning has begun Missing values per column: MPG 0 Cylinders 0 Displacement 0 Horsepower 6 Weight 0 Acceleration 0 Model Year 0 Origin 0 dtype: int64 Data cleaning complete! A sample of dataset after data cleaning:
| MPG | Cylinders | Displacement | Horsepower | Weight | Acceleration | Model Year | Europe | Japan | USA | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 18.0 | 8 | 307.0 | 130.0 | 3504.0 | 12.0 | 70 | 0 | 0 | 1 |
| 1 | 15.0 | 8 | 350.0 | 165.0 | 3693.0 | 11.5 | 70 | 0 | 0 | 1 |
| 2 | 18.0 | 8 | 318.0 | 150.0 | 3436.0 | 11.0 | 70 | 0 | 0 | 1 |
| 3 | 16.0 | 8 | 304.0 | 150.0 | 3433.0 | 12.0 | 70 | 0 | 0 | 1 |
Key Data Cleaning Steps
Handle Missing Values - Use
dropna()to remove rows with missing dataMap Categorical Data - Convert origin codes (1,2,3) to meaningful country names
One-Hot Encoding - Use
pd.get_dummies()to convert categorical variables into binary columnsData Validation - Check for missing values using
isna().sum()
Conclusion
Data cleaning is crucial for accurate fuel efficiency prediction. The key steps include handling missing values, mapping categorical variables, and creating dummy variables for machine learning models. Clean data ensures better model performance and reliable predictions.
