Regression Analysis and the Best Fitting Line using C++


Regression Analysis is the most basic form of predictive analysis.

In Statistics, linear regression is the approach of modeling the relationship between a scalar value and one or more explanatory variables.

In Machine learning, Linear Regression is a supervised algorithm. Such an algorithm predicts a target value based on independent variables.

More About Linear Regression and Regression Analysis

In Linear Regression / Analysis the target is a real or continuous value like salary, BMI, etc. It is generally used to predict the relationship between a dependent and a bunch of independent variables. These models generally fit a linear equation, however, there are other types of regression as well including higher-order polynomials

Before fitting a linear model on the data, it is necessary to check if the data points have linear relationships between them. This is evident from their scatterplots. The goal of the algorithm/model is to find the best-fitting line.

In this article, we are going to explore Linear Regression Analysis and its implementation using C++.

The linear regression equation is in the form of Y = c + mx , where Y is the target variable and X is the independent or explanatory parameter/variable. m is the slope of the regression line and c is the intercept. Since this is a 2-dimensional regression task, the model tries to find the line of best fit during training. It is not necessary that all the points exactly line on the same line. Some of the data points may lie on the line and some scattered around it. The vertical distance between the line and the data point is the residual. This can be either negative or positive based on whether the point lies below or above the line. Residuals are the measure of how well the line fits the data. The algorithm continues to minimize the total residual error.

The residual for each observation is the difference between predicted values of y(dependent variable) and observed values of y

$$\mathrm{Residual\: =\: actual\: y\: value\:−\:predicted\: y\: value}$$


The most common metric for evaluating linear regression model performance is called root mean squared error, or RMSE. The basic idea is to measure how bad/erroneous the model's predictions are when compared to actual observed values.

So, a high RMSE is “bad” and a low RMSE is “good”

RMSE error is given as


Implementation using C++

#include<iostream> #define N 50 using namespace std; int main(){ int n, i; float x[N], y[N], sum_x=0, sum_x2=0, sum_y=0, sum_xy=0, a, b; /* Input */ cout<<"Please enter the number of data points.."; cin>>n; cout<<"Enter data:"<< endl; for(i=1;i<=n;i++){ cout<<"x["<< i <<"] = "; cin>>x[i]; cout<<"y["<< i <<"] = "; cin>>y[i]; } /* Calculating Required Sum */ for(i=1;i<=n;i++){ sum_x = sum_x + x[i]; sum_x2 = sum_x2 + x[i]*x[i]; sum_y = sum_y + y[i]; sum_xy = sum_xy + x[i]*y[i]; } /* Calculating a and b */ b = (n*sum_xy-sum_x*sum_y)/(n*sum_x2-sum_x*sum_x); a = (sum_y - b*sum_x)/n; /* Displaying value of a and b */ cout<<"Calculated value of a is "<< a << "and b is "<< b << endl; cout<<"Equation of best fit line is: y = "<< a <<" + "<< b<<"x"; return(0); }


Please enter the number of data points..5
Enter data:
x[1] = 2
y[1] = 5
x[2] = 5
y[2] = 7
x[3] = 2
y[3] = 6
x[4] = 8
y[4] = 9
x[5] = 2
y[5] = 7
Calculated value of a is 4.97917 and b is 0.479167
Equation of best fit line is: y = 4.97917 + 0.479167x


Regression Analysis is a very simple yet powerful technique for predictive analysis both in Machine Learning and Statistics. The idea lies in its simplicity and underlying linear relationships between independent and target variables.