- Trending Categories
Data Structure
Networking
RDBMS
Operating System
Java
MS Excel
iOS
HTML
CSS
Android
Python
C Programming
C++
C#
MongoDB
MySQL
Javascript
PHP
Physics
Chemistry
Biology
Mathematics
English
Economics
Psychology
Social Studies
Fashion Studies
Legal Studies
- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who
Exploratory Data Analysis on Iris Dataset
Introduction
In Machine Learning and Data Science Exploratory Data Analysis is the process of examining a data set and summarizing its main characteristics about it. It may include visual methods to better represent those characteristics or have a general understanding of the dataset. It is a very essential step in a Data Science lifecycle, often consuming a certain time.
In this article, we are going to see some of the characteristics of the Iris dataset through Exploratory Data Analysis.
The Iris Dataset
The Iris Dataset is very simple often referred to as the Hello World. The dataset has 4 features of three different species of flowers namely Iris setosa, Iris virginica, and Iris versicolor. These features are sepal length, sepal width, petal length, and petal width. There are 150 data points in the dataset, 50 data points for each species.
EDA on Iris Dataset
First, let's load the dataset from the CSV file "iris_csv.csv" using pandas and have a general overview of it.
The dataset can be downloaded from the below link.
https://datahub.io/machine-learning/iris/r/iris.csv
Code Implementation
Example 1
import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt %matplotlib inline df = pd.read_csv("/content/iris_csv.csv") df.head()
sepallength |
sepalwidth |
petallength |
petalwidth |
class |
|
---|---|---|---|---|---|
0 |
5.1 |
3.5 |
1.4 |
0.2 |
Iris-setosa |
1 |
4.9 |
3.0 |
1.4 |
0.2 |
Iris-setosa |
3 |
4.6 |
3.1 |
1.5 |
0.2 |
Iris-setosa |
4 |
5.0 |
3.6 |
1.4 |
0.2 |
Iris-seto |
Example 2
df.info() RangeIndex: 150 entries, 0 to 149 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 sepallength 150 non-null float64 1 sepalwidth 150 non-null float64 2 petallength 150 non-null float64 3 petalwidth 150 non-null float64 4 class 150 non-null object dtypes: float64(4), object(1) memory usage: 6.0+ KB df.shape (150, 5) ## Statistics about dataset df.describe()
sepallength |
sepalwidth |
petallength |
petalwidth |
|
---|---|---|---|---|
count |
150.000000 |
150.000000 |
150.000000 |
150.000000 |
mean |
5.843333 |
3.054000 |
3.758667 |
1.198667 |
std |
0.828066 |
0.433594 |
1.764420 |
0.763161 |
min |
4.300000 |
2.000000 |
1.000000 |
0.100000 |
25% |
5.100000 |
2.800000 |
1.600000 |
0.300000 |
50% |
5.800000 |
3.000000 |
4.350000 |
1.300000 |
max |
7.900000 |
4.400000 |
6.900000 |
2.500000 |
Example 3
## checking for null values df.isnull().sum() sepallength 0 sepalwidth 0 petallength 0 petalwidth 0 class 0 dtype: int64 ## Univariate analysis df.groupby('class').agg(['mean', 'median']) # passing a list of recognized strings df.groupby('class').agg([np.mean, np.median])
sepallength | sepalwidth | petallength | petalwidth | |||||
---|---|---|---|---|---|---|---|---|
mean | median | mean | median | mean | median | mean | median | |
class | ||||||||
Iris−setosa | 5.006 | 5.0 | 3.418 | 3.4 | 1.464 | 1.50 | 0.244 | 0.2 |
Iris−versicolor | 5.936 | 5.9 | 2.770 | 2.8 | 4.260 | 4.35 | 1.326 | 1.3 |
Iris−virginica | 6.588 | 6.5 | 2.974 | 3.0 | 5.552 | 5.55 | 2.026 | 2.0 |
Example 4
## Box plot plt.figure(figsize=(8,4)) sns.boxplot(x='class',y='sepalwidth',data=df ,palette='YlGnBu')
Example 5
## Distribution of particular species sns.distplot(a=df['petalwidth'], bins=40, color='b') plt.title('petal width distribution plot')
Example 6
## count of number of observation of each species sns.countplot(x='class',data=df)
Example 7
## Correlation map using a heatmap matrix sns.heatmap(df.corr(), linecolor='white', linewidths=1)
Example 8
## Multivariate analysis – analyis between two or more variable or features ## Scatter plot to see the relation between two or more features like sepal length, petal length,etc axis = plt.axes() axis.scatter(df.sepallength, df.sepalwidth) axis.set(xlabel='Sepal_Length (cm)', ylabel='Sepal_Width (cm)', title='Sepal-Length vs Width');
Example 9
sns.scatterplot(x='sepallength', y='sepalwidth', hue='class', data=df, plt.show()
Example 10
## From the above graph we can see that # Iris-virginica has a longer sepal length while Iris-setosa has larger sepal width # For setosa sepal width is more than sepal length ## Below is the Frequency histogram plot of all features axis = df.plot.hist(bins=30, alpha=0.5) axis.set_xlabel('Size in cm');
Example 11
# From the above graph we can see that sepalwidth is longer than any other feature followed by petalwidth ## examining correlation sns.pairplot(df, hue='class')
Example 12
figure, ax = plt.subplots(2, 2, figsize=(8,8)) ax[0,0].set_title("sepallength") ax[0,0].hist(df['sepallength'], bins=8) ax[0,1].set_title("sepalwidth") ax[0,1].hist(df['sepalwidth'], bins=6); ax[1,0].set_title("petallength") ax[1,0].hist(df['petallength'], bins=5); ax[1,1].set_title("petalwidth") ax[1,1].hist(df['petalwidth'], bins=5);
Example 13
# From the above plot we can see that – # - Sepal length highest freq lies between 5.5 cm to 6 cm which is 30-35 cm # - Petal length highest freq lies between 1 cm to 2 cm which is 50 cm # - Sepal width highest freq lies between 3 cm to 3.5 cm which is 70 cm # - Petal width highest freq lies between 0 cm to 0.5 cm which is 40-45 cm
Conclusion
Exploratory Data Analysis is extremely used by both Data Scientists and Analysts. It tells a lot about the characteristics of the given data, its distribution, and how it can be useful.
- Related Articles
- Exploratory Data Analysis in Python
- Exploratory Data Analysis (EDA) - Types and Tools
- How to transform Scikit-learn IRIS dataset to 2-feature dataset in Python?
- Analyzing Decision Tree and K-means Clustering using Iris dataset
- How can Tensorflow be used with Estimators to split the iris dataset?
- How can Tensorflow be used with premade estimator to download the Iris dataset?
- Data Analysis with Spreadsheets
- Data Analysis in Psychology
- Python Data analysis and Visualization
- Data analysis using Python Pandas
- Olympics Data Analysis Using Python
- Data Analysis in Financial Market
- How data mining can help financial data analysis?
- Data Analysis and Visualization in Python?
- 10 Best Exploratory Testing Tools
