How to use ML for Wine Quality Prediction?


This tutorial will take a wine quality dataset from online sources such as Kaggle. The preferred dataset is the "Wine Quality Dataset," available at "https://www.kaggle.com/datasets/yasserh/wine-quality-dataset."

The dataset contains a .csv file comprising various categories of wine, such as 'fixed acidity,' 'volatile acidity,' 'pH,' 'density,' and more. From this dataset, the field name 'quality' was dropped at the initial stage, and further, the model was trained.

Here is the Python code to predict the wine quality.

  • Importing the necessary libraries.

import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
  • Import the wine quality dataset

wine = pd.read_csv('/Users/someswarpal/Downloads/WineQT.csv')
  • Drop the column named quality.

X = wine.drop(columns=['quality'])
y = wine['quality']
  • Split the data into testing and training sets.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
  • Create a linear regression model

model = LinearRegression()
  • Train the model

model.fit(X_train, y_train)
  • Make predictions on the training sets.

y_pred = model.predict(X_test)
  • Evaluate the model

mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)
  • Calculate the mean quality for each category

mean_quality = wine.groupby('quality')['quality'].mean()

Output

Mean Squared Error: 0.38242835212919696
  • Find the category with the highest mean quality

best_quality = mean_quality.idxmax()
best_mean_quality = mean_quality.max()
  • Print the summary for best Wine.

print("Summary of Wine Quality:")
print("----------------------------")
print("Best Wine Quality Category:", best_quality)
print("Mean Quality Score:", best_mean_quality)

Output

Summary of Wine Quality:
   ----------------------------
   Best Wine Quality Category: 8
   Mean Quality Score: 8.0
  • Find the category with the lowest mean quality

worst_quality = mean_quality.idxmin()
worst_mean_quality = mean_quality.min()
  • Print the summary for worst Wine

Example

print("Summary of Wine Quality:")
print("----------------------------")
print("Worst Wine Quality Category:", worst_quality)
print("Mean Quality Score:", worst_mean_quality)

Output

Summary of Wine Quality:
----------------------------
Worst Wine Quality Category: 3
Mean Quality Score: 3.0

Conclusion

In conclusion, the code analyzes and displays data from a collection about wine quality in several ways. It starts by reading the dataset and separating it into input features (X) and the goal variable (y). The training set is then used to make and train a linear regression model. On the test set, predictions are then made, and the mean squared error is used to measure how well the model works.

The code also determines each category's average quality in the dataset and finds the category whose average quality is the best. Scatter plots, histograms, box plots, bar charts, line plots, correlation heatmaps, and pie charts are some of the images that can be made. These pictures show how different things affect the quality of the wine.

Overall, the code thoroughly studies the wine quality dataset, from modeling and evaluating the data to showing how the data are distributed and how they relate to each other. It shows how to use famous libraries for data analysis and visualization, such as Pandas, NumPy, sci-kit-learn, matplotlib, and Seaborn, to make the analysis process more accessible and give helpful information for understanding the dataset.

Updated on: 12-Oct-2023

118 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements