Multivariate Linear Regression — Luggage Bags Cost Prediction

7 min readAug 11, 2021

--

Problem Statement

We would want to predict the cost of the luggage bags based on the information we have on 160 different bags associated to a particular industry. These bags have certain attributes as described below:

Height
Weight
Weight1 (Weight that can be carried after expansion)
Width
Length

The company now wants to predict the cost they should set for new variant of these kinds of bags. We would want to build a prediction model to achieve this.

Understanding the data

After making the necessary imports (refer to the code sample in the end), lets look at a sample data set —

data = pd.read_csv("Dataset_multivariatelinearregression.csv")

As we see, all of these attributes of the bags are numerical values, understanding this data better:

Exploratory Data Analysis

Exploratory data analysis (EDA) is used by data scientists to analyze and investigate data sets and summarize their main characteristics, often employing data visualization methods. It helps determine how best to manipulate data sources to get the answers you need, making it easier for data scientists to discover patterns, spot anomalies, test a hypothesis, or check assumptions.

EDA is primarily used to see what data can reveal beyond the formal modeling or hypothesis testing task and provides a provides a better understanding of data set variables and the relationships between them. It can also help determine if the statistical techniques you are considering for data analysis are appropriate. Originally developed by American mathematician John Tukey in the 1970s, EDA techniques continue to be a widely used method in the data discovery process today.

Read more here.

Checking for data types and presence of null values:

Removing duplicate values:

Understanding the distribution of cost across the bags:

We can see that the cost mostly lies between 0–500 and the number of bags having cost higher is gradually decreasing as the price is increasing.

Understanding the influence of each of the metrics on cost

Influence of Weight on Cost:

Clearly indicates that cost is dependent on weight. Cost increases as weight increases.

Influence of Length on Cost:

Again, this clearly indicates that cost is dependent on length. Cost increases as length increases.

Influence of Height on Cost:

From this graph we cannot clearly conclude if the Cost depends on the Height of the Bag. This needs to be further analyzed.

Influence of Width on Cost:

This again indicates that there is a relation between cost and width of the bag. Its not exactly progressive, as there are some outliers. However, cost is definitely dependent on width of the bag.

Influence of Weight1 (Weight that can be carried after expansion) on Cost:

This also indicates that there is a relation between cost and expanded weight of the bag. Its not exactly progressive, as there are some outliers. However, cost is definitely dependent on expanded weight of the bag.

Understanding the influence of each of these attributes on the other:

Looking at the pair plots:

Generating the correlation between the attributes:

Plotting the correlation diagram:

Most attributes here show that they directly influence the cost of the bag. Color is some shade of green. However, here, we can again see that Cost is not as dependent on Height as the other features. The next step is to look for the outliers by generating box plots for all attributes.

Outlier Detection and its importance:

Outlier detection has been used for many decades to detect points that are considered “abnormal,” or which don’t fit a particular pattern. Because of its highly practical nature, outlier detection is used in many practical use cases. The most famous examples include the detection of (financial) fraud and the detection of ‘malicious’ chatter by intelligence agencies. Because outlier detection algorithms are so useful for any organization, this article explores common outlier detection techniques and their application to Big Data environments.

Outlier detection is a summary term for a broad spectrum of outlier detection techniques. Over the years, many different terminology has arisen that is similar in nature, such as novelty detection, anomaly detection, noise detection, deviation detection and exception mining.

Building Regression Models

Generating the data required for the Model:

Using Simple Linear Regression Model

Using all the features to determine the cost and building a Linear Regression Model using the LinearRegression available on scikit-learn.

Mean squared error = 9552.62
Root mean squared error = 97.74
Variance score = 0.90

Removing Height as a feature to check if that is a better model:

Mean squared error = 11678.68
Root mean squared error = 108.07
Variance score = 0.88

Comparing the MSE we can see that the Height as a feature determines the value of the cost as holding the value has resulted in a higher variance score and lower RMSE value.

Using Simple Gradient Descent

Let’s try to build the model using the concepts on Gradient Descent and simple Python Code.

Defining the cost function:

Defining the function for Gradient Descent:

Running the algorithm:

Iterations used in the range (1,200000): 60000
w0: 372.087,
w1: 966.231,
w2: -503.842,
w3: -396.592,
w4: 161.433 and
w5: 108.256
Mean Squared Error of training data: 66481685.81
Mean squared error = 9552.83
Root mean squared error = 97.74
Variance score = 0.90

Visualizing the data:

Equation from Gradient Descent Algorithm is Cost = w0+ w1Weight + w2Weight1 + w3Length + w4 Height + w5*Width

Generating predicted y with the model parameters:

gd_y_pred = n[0][0] + X_test[“Weight”]n[1][0] + X_test[“Weight1”]n[2][0] + X_test[“Length”]n[3][0] + X_test[“Height”]n[4][0] +X_test[“Width”]*n[5][0]

Comparing r2 score with increasing number of iterations:

Using Stochastic Gradient Descent

Iterations used in the range (1,5000): 254
w0: 368.499,
w1: 978.475,
w2: -485.503,
w3: -428.245,
w4: 169.184 and
w5: 100.895
Mean Squared Error of training data: 286144.02
Mean squared error = 9489.61
Root mean squared error = 97.41
Variance score = 0.90

Relating r2 score with iterations:

Using Mini Batch Gradient Descent

No of Iterations in the range (1,3000) that yielded the results: 89
w0: 370.914,
w1: 954.466,
w2: -492.885,
w3: -400.046,
w4: 166.143 and
w5: 105.642
Mean squared error = 9521.69
Root mean squared error = 97.58
Variance score = 0.90

Relating r2 score with iterations:

Using SGDRegressor

Mean squared error = 10673.77
Root mean squared error = 103.31
Variance score = 0.89

Summary

Comparing the results across the various algorithms:

Comparing the error scores

Comparing the Accuracy Scores:

Mean Squared Error Comparison

Root Mean Squared Error Comparison

R2 Score Comparison

All of these algorithms have the potential to perform even better provided they are tuned further.

Here is a link to the code and the sample data used.

Additional References and Notes