Linear Regression — From EDA to Model Optimization(Part 1)

8 min readSep 19, 2022

In this blog we build a recommendation model from scratch for estimating crew size for potential ship buyers using cruise_ship_info.csv dataset.

This blog will deep dive into theoretical and practical concepts in Machine Learning and Data Science to help you understand thoroughly the below points —

Why this model (Linear/Non-linear)?
How to select features just by analyzing data and via Features selection algorithms as well?
Why and How to do regression analysis? — Residual Analysis significance, Actual vs Predicted graph interpretation
Why and How to do Goodness-of-fit test for model evaluation? — R_Squared and Adjusted R_Squared
How to optimize model for better estimates? — Grid Search

This blog will also highlight the common issues like Multi-collinearity, Hetroscedasticity, Overfitting and Underfitting, their impact on model performance and solutions to these problems.

This blog is the 1st part of the series and will cover EDA.

So, Lets dive into this journey of Why to How!!

Exploratory Data Analysis

EDA is all about understanding the data, like basic statistics (Mean, Standard Deviation, Range and so on), it’s Distribution, data types, identifying missing values or NaN, understanding relationship between Independent variables(IVs) and Dependent variables(DVs), etc.

The better we understand data, the powerful our model will be.

1. Inspect columns and its data type, check for Null Values

1. data types info, 2. levels of column Cruise_line, 3. levels of Ship_name — Image by Author

Observations from above images —
First Image displays information about Columns present in dataset. We can see that dataset contains no Null values. It gives us information about data_types as well, which highlights that Ship_name and Cruise_line are categorical columns.
Second Image gives information about the levels(Category) in Cruise_line along with the count of data points in those categories, in descending order. Cruise_line has 20 levels and we can see that 70% of data is distributed in top 5 levels of this columns. Well keeping in mind that this column can have a business importance, these levels can be included during model fit and later it can be identified during model optimization, whether they are statistically significant or not.
Third Image gives information about the levels present in Ship_name column. Ship_name has 138 categories, which is near by the Total number of rows in dataset. Using these categories will increase the complexity of the model. It even breaks the general thumb rule of regression that at-least 10points per Independent variable is needed for a good fit. Hence Ship_name can be avoided during model fit.

2. Inspect statistical characteristics of data

Observations from basic characteristics —
Counts of each variable is 158, indicates that dataset contains no missing value.
Range( Min and Max value) of variables are different, it means we need to standardize or normalize the data before feeding it to the model.

3. Pair-plot data to understand distribution and relationship

Pair-plot is a best technique to visually analyze the distribution of data and identify the relationship between variables. Identifying relationship means finding out whether variables are linearly related with each other or have a non-linear relation. This insightful information can help in deciding what type of model (Linear or Non-Linear) will fit for this task.

Pairplot of variables in Data. — Image by Author

Observations from Pair-plot —
Independent Variables Tonnage, passengers, length and cabins are highly correlated with Response variable crew. We can see positive correlation with Crew. The strength of correlation can be verified with the help of Correlation Matrix as well.
The presence of Correlation indicates that a Linear model will be a good fit for this dataset.
Presence of Multi-collinearity can be sense among 4 Independent variables Tonnage, passengers, length and cabins.

In above observations, I have mentioned terms like Correlation and Multi-collinearity. Lets understand what they mean and whether their presence is a good or bad sign for the model.

Correlation is the statistical relationship between two random variables such that increase or decrease in one fluctuate the value of other.

The Pearson correlation coefficient (denoted by r) is a measure of any linear association between two variables. The value of r ranges between −1 and 1.

When,

r = 0, it means that there is no linear association between the variables
r = 1, means perfect linear relationship exists between the variables
r > 0, means positive correlation
r < 0, means positive correlation
|r| > 0.5, means strong positive correlation between variables

As we saw in Pairplot that variables has linear association, lets verify the strength of linear relationship with the help of Correlation Matrix.

4. Plot Pearson Correlation matrix

Correlation Matrix is a table that summarizes Pearson correlation coefficient between all variables in a dataset.

Correlation Matrix plot — Image by Author

Observations from Correlation Matrix plot:
Tonnage, passengers, length and cabins has r > 0.9 with crew, this indicates a very strong positive correlation with response variable(Crew).
Strong positive correlation among these variables can also be seen. This is termed as Multicollinearity, i.e. correlation present among several variables.

Now the point is that why its important to check for Multicollinearity? Is Multicollinearity a potential problem? Yes or No, It depends on whether our focus is To interpret the statistical significance of regression coefficients or just on Prediction.

Let’s understand this why it’s a problem, cause of Multicollinearity and How to fix it.
Why its a problem, lies in concept of understanding the interpretation of regression coefficients. Regression Coefficients are interpreted as mean change in Dependent variable with 1 unit change in Regression Coefficient, setting other coefficients value to 0.
Lets understand this with a example. A simple linear regression equation with 2 variables will be —

y = αx1 + βx2 + c + e,

α and β are coefficients of x1 and x2, c is constant and ‘e’ is error term. If β is 0 then 1 unit change in α is equal to mean change in ‘y’. But if α and β are correlated, then change in β will affect α, cause of which we cannot determine the significance of α. But ‘y’ estimate will not get affected due to this.
How to detect Multicollinearity?
Well there are several ways to detect multicollinearty, but this post will discuss about the most popular ways.
1. By Visually analyzing the data using pair plot.
2. Using correlation matrix
These 2 are already explained above.
3. Using Variance Inflation Factor(VIF)

What is VIF? It’s a technique in which each Variable is regressed against remaining variables and a score is calculated given by below formula.

Here, R-Squared is the Coefficient of determination, which values lies between 0 to 1. Its a evaluation metric for regression model. For now keep this in envelope, will cover this in detail in 2nd part of this series.
Generally, if a variable has VIF more than 5, then its considered to be highly collinear with other variables. Means high multicollinearity.

How to fix Multicollinearity?
1. Remove variables having VIF > = 5.
2. Linearly combine 2 or more variables.
3. Using Principle Component Analysis(PCA)
4. Regularization technique LASSO can also be used to remove multicollinearity.

Question that pop-ups in mind due to multicollinearity —
Can removing multicollinearity improves model prediction? Well, in general multicollinearity has no effect on prediction, So its not required. But in case where we have high numbers of predictors( > 10), then fixing multicollineairty can improve model performance and precision. As high numbers of predictors can increase complexity, which will result in overfitting, and model will not be able to generalize well on test data. hence in that scenario, fixing multicollinearity can be helpful. But always keep in mind, while fixing multicollinearity, we remove variables having high VIF, and this tends to loss of information. So we might lose variable, which has business significance while estimating response variable.

5. Feature Selection

This is the most important part of model building. Once we have a clean and preprocessed data and we have understood the relationship among variables and with dependent variables, comes the feature selection part. To build a significant and precise model, its important to select statistically highly significant features.

There are several ways to do feature selection —
- With help of Pairplot and Correlation Matrix — Select variables which has high correlation value with Dependent variable. Correlation can be seen in pairplot as well.
- Stepwise Regression
- Forward Selection
- Backward Elimination
- Mutual Information Score

Last 4 methods will need understanding of p-value, t-value, R-squared and Adj. R-Squared, so will cover these methods and terms in next post of this series.
Now coming back to the first method ( Using Correlation Matrix), list of selected features will be —
Selected_list = [‘Tonnage’, ‘passengers’, ‘length’, ‘cabins’, ‘Cruise_line_Royal_Caribbean’, ‘Cruise_line_Carnival’, ‘Cruise_line_Princess’, ‘Cruise_line_Holland_American’, ‘Cruise_line_Costa’, ‘Cruise_line_Norwegian’, ‘Cruise_line_Celebrity’ ]
Base_list = [‘Tonnage’, ‘passengers’, ‘length’, ‘cabins’, ‘passenger_density’, ‘Cruise_line_Royal_Caribbean’, ‘Cruise_line_Carnival’, ‘Cruise_line_Princess’, ‘Cruise_line_Holland_American’, ‘Cruise_line_Costa’, ‘Cruise_line_Norwegian’, ‘Cruise_line_Celebrity’ ]
In selected list {‘Tonnage’, ‘passengers’, ‘length’, ‘cabins’}, are selected cause they are highly correlated with Crew and ‘Cruise_line_* (One hot encoded predictors of Cruise_line) is selected considering business impact. In next blog we will see how the model behaves if fed with these sets of features and will try to optimize the model performance.

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

Thank You !

I hope you enjoyed reading this article. If you like it, please give a clap. This is my first blog ever so your clap will boost my confidence to write more stories. Will keep sharing such stories with an end-to-end approach towards practical problems and theories behind that.

Also, do follow me on LinkedIn and GitHub.

HAPPY LEARNING!