Multivariate Linear Regression — Luggage Bags Cost Prediction
Problem Statement
We would want to predict the cost of the luggage bags based on the information we have on 160 different bags associated to a particular industry. These bags have certain attributes as described below:
- Height
- Weight
- Weight1 (Weight that can be carried after expansion)
- Width
- Length
The company now wants to predict the cost they should set for new variant of these kinds of bags. We would want to build a prediction model to achieve this.
Understanding the data
After making the necessary imports (refer to the code sample in the end), lets look at a sample data set —
data = pd.read_csv("Dataset_multivariatelinearregression.csv")
As we see, all of these attributes of the bags are numerical values, understanding this data better:
Exploratory Data Analysis
Exploratory data analysis (EDA) is used by data scientists to analyze and investigate data sets and summarize their main characteristics, often employing data visualization methods. It helps determine how best to manipulate data sources to get the answers you need, making it easier for data scientists to discover patterns, spot anomalies, test a hypothesis, or check assumptions.
EDA is primarily used to see what data can reveal beyond the formal modeling or hypothesis testing task and provides a provides a better understanding of data set variables and the relationships between them. It can also help determine if the statistical techniques you are considering for data analysis are appropriate. Originally developed by American mathematician John Tukey in the 1970s, EDA techniques continue to be a widely used method in the data discovery process today.
Read more here.
Checking for data types and presence of null values:
Removing duplicate values:
Understanding the distribution of cost across the bags:
We can see that the cost mostly lies between 0–500 and the number of bags having cost higher is gradually decreasing as the price is increasing.
Understanding the influence of each of the metrics on cost
Influence of Weight on Cost:
Influence of Length on Cost:
Influence of Height on Cost:
Influence of Width on Cost:
Influence of Weight1 (Weight that can be carried after expansion) on Cost:
Understanding the influence of each of these attributes on the other:
Looking at the pair plots:
Generating the correlation between the attributes:
Plotting the correlation diagram:
Outlier Detection and its importance:
Outlier detection has been used for many decades to detect points that are considered “abnormal,” or which don’t fit a particular pattern. Because of its highly practical nature, outlier detection is used in many practical use cases. The most famous examples include the detection of (financial) fraud and the detection of ‘malicious’ chatter by intelligence agencies. Because outlier detection algorithms are so useful for any organization, this article explores common outlier detection techniques and their application to Big Data environments.
Outlier detection is a summary term for a broad spectrum of outlier detection techniques. Over the years, many different terminology has arisen that is similar in nature, such as novelty detection, anomaly detection, noise detection, deviation detection and exception mining.
Read more here.
Generating the box-plots to identify outliers:
Identifying the outliers using IQR:
Using the above IQR scores, we can remove the outliers.
Generating the box plots to validate the presence of outliers after the previous step:
Removing this outlier using the IQR again:
Confirming that there are no more outliers:
Identifying outliers by other means
Looking at the costs of the bags, we can see that there was a bag whose cost is 0. This cannot be the case for the given attributes. This would be an outlier as well. Removing the outlier.
Building Regression Models
Generating the data required for the Model:
Using Simple Linear Regression Model
Using all the features to determine the cost and building a Linear Regression Model using the LinearRegression available on scikit-learn.
Mean squared error = 9552.62
Root mean squared error = 97.74
Variance score = 0.90
Removing Height as a feature to check if that is a better model:
Mean squared error = 11678.68
Root mean squared error = 108.07
Variance score = 0.88
Comparing the MSE we can see that the Height as a feature determines the value of the cost as holding the value has resulted in a higher variance score and lower RMSE value.
Using Simple Gradient Descent
Let’s try to build the model using the concepts on Gradient Descent and simple Python Code.
Defining the cost function:
Defining the function for Gradient Descent:
Running the algorithm:
Iterations used in the range (1,200000): 60000
w0: 372.087,
w1: 966.231,
w2: -503.842,
w3: -396.592,
w4: 161.433 and
w5: 108.256
Mean Squared Error of training data: 66481685.81
Mean squared error = 9552.83
Root mean squared error = 97.74
Variance score = 0.90
Visualizing the data:
Equation from Gradient Descent Algorithm is Cost = w0+ w1Weight + w2Weight1 + w3Length + w4 Height + w5*Width
Generating predicted y with the model parameters:
gd_y_pred = n[0][0] + X_test[“Weight”]n[1][0] + X_test[“Weight1”]n[2][0] + X_test[“Length”]n[3][0] + X_test[“Height”]n[4][0] +X_test[“Width”]*n[5][0]
Comparing r2 score with increasing number of iterations:
Using Stochastic Gradient Descent
Iterations used in the range (1,5000): 254
w0: 368.499,
w1: 978.475,
w2: -485.503,
w3: -428.245,
w4: 169.184 and
w5: 100.895
Mean Squared Error of training data: 286144.02
Mean squared error = 9489.61
Root mean squared error = 97.41
Variance score = 0.90
Relating r2 score with iterations:
Using Mini Batch Gradient Descent
No of Iterations in the range (1,3000) that yielded the results: 89
w0: 370.914,
w1: 954.466,
w2: -492.885,
w3: -400.046,
w4: 166.143 and
w5: 105.642
Mean squared error = 9521.69
Root mean squared error = 97.58
Variance score = 0.90
Relating r2 score with iterations:
Using SGDRegressor
Mean squared error = 10673.77
Root mean squared error = 103.31
Variance score = 0.89
Summary
Comparing the results across the various algorithms:
Comparing the error scores
Comparing the Accuracy Scores:
Mean Squared Error Comparison
Root Mean Squared Error Comparison
R2 Score Comparison
All of these algorithms have the potential to perform even better provided they are tuned further.
Here is a link to the code and the sample data used.