PM ProjectJune - 2021

PREDICTIVE
MODELLING
PGP-DSBA ONLINE JUNE_C 2021
ABHISHEK ROY
Table of Contents
Problem 1: Linear Regression .........................................................................................................................4
Data Dictionary: ..............................................................................................................................................4
1.1 Exploratory Data analysis......................................................................................................................4
Univariate analysis .....................................................................................................................................6
Bivariate analysis .......................................................................................................................................7
Multivariate analysis ................................................................................................................................11
1.2 Check for null values and checks for sub level combinations ..................................................12
Check for outliers and outliers’ treatment .........................................................................................13
1.3 Splitting the Dataset and create multiple models .........................................................................14
1.4 Inference...................................................................................................................................................18
Recommendations ...................................................................................................................................18
Problem 2: Logistic Regression and LDA ..................................................................................................19
Data Dictionary: ............................................................................................................................................19
2.1 Data Ingestion .........................................................................................................................................19
Univariate analysis ...................................................................................................................................21
Bivariate analysis .....................................................................................................................................22
Multivariate analysis ................................................................................................................................24
2.2 and 2.3 Data Split and LR/LDA application along with Performance metrices .....................26
Logistic Regression .................................................................................................................................26
LDA Model ..................................................................................................................................................28
2.4 Inference...................................................................................................................................................32
Recommendations ...................................................................................................................................32
1
List of Figures
Figure 1: Data types information of the dataset ...............................................................................................5
Figure 2: Categorical Data unique values .........................................................................................................5
Figure 3: Box Plots of all the numeric variables ...............................................................................................6
Figure 4: Distribution plot of all the numeric variables ....................................................................................7
Figure 5: Count plot of the categorical variables .............................................................................................7
Figure 6: Categorical variables w.r.t. price .......................................................................................................7
Figure 7: color vs cut crosstab plot ....................................................................................................................8
Figure 8: cut vs clarity crosstab plot...................................................................................................................9
Figure 9: Correlation plot of all the numeric variables w.r.t. price (individually) ........................................10
Figure 10: Correlation matrix.............................................................................................................................11
Figure 11: Correlation matrix heatmap ............................................................................................................11
Figure 12: Check for null values .......................................................................................................................12
Figure 13: Check for 0s in the dataset.............................................................................................................12
Figure 14: Check for null values after removing them ..................................................................................13
Figure 15: Box plot to check outliers................................................................................................................13
Figure 16: Box plot after treating the outliers .................................................................................................14
Figure 17: Linear regression Model with depth variable ...............................................................................16
Figure 18: Linear regression Model without depth variable .........................................................................17
Figure 19: Datatype information of the dataset ..............................................................................................19
Figure 20: Unique count of the categories ......................................................................................................20
Figure 21: Proportions of the target variable ..................................................................................................20
Figure 22: Box Plot of the numeric variables ..................................................................................................21
Figure 23: Distribution plot of the numeric variables .....................................................................................21
Figure 24: Count plot for the categorical variables ........................................................................................22
Figure 25: Bar plot of the numeric variables w.r.t. Holliday_Package ........................................................22
Figure 26: Bar plot b/w Holiday package and Salary ....................................................................................23
Figure 27: Scatter and LM plot b/w age and salary .......................................................................................23
Figure 28: Pair plot of all the numeric variables w.r.t Holiday package......................................................24
Figure 29: Correlation matrix.............................................................................................................................25
Figure 30: Box plot of the numeric variables after treating the outliers ......................................................25
Figure 31: Grid Search best parameters .........................................................................................................26
Figure 32: Confusion matrix of the Train and Test dataset ..........................................................................27
Figure 33: Logistic Regression metrics for Train and test data ...................................................................27
Figure 34: AUC Score for train and test data .................................................................................................27
Figure 35: Values assigned to the categories ................................................................................................28
Figure 36: Datatype info for the dataset ..........................................................................................................28
Figure 37: Confusion Matrix for Train and test data ......................................................................................29
Figure 38: Accuracy, F1 score and confusion matrix based on the cut-off score .....................................30
Figure 39: AUC score for the train and test data in LDA model ..................................................................31
2
List of Tables
Table 1: Data Summary .......................................................................................................................................5
Table 2: 1st 5 rows of the updated dataset after removing Unnamed:0 column .......................................6
Table 3: color vs cut crosstab .............................................................................................................................8
Table 4: cut vs clarity crosstab ...........................................................................................................................9
Table 5: Scaled dataset .....................................................................................................................................13
Table 6: Dataset with Dummy variables ..........................................................................................................14
Table 7: Predictor variables ..............................................................................................................................14
Table 8: 1st 5 rows of the dataset ....................................................................................................................20
Table 9: Data Summary of the dataset (numeric variables) .........................................................................20
Table 10: 1st 5 rows of the dataset with dummy variables ..........................................................................26
Table 11: 1st 5 rows of the dataset ..................................................................................................................28
Table 12: LR and LDA data for Accuracy, AUC, Recall. Precision and F1 score.....................................31
List of Equations
Equation 1: Linear regression equation of the model ...................................................................................18
3
Problem 1: Linear Regression
You are hired by a company Gem Stones co ltd, which is a cubic zirconia manufacturer. You are provided
with the dataset containing the prices and other attributes of almost 27,000 cubic zirconia (which is an
inexpensive diamond alternative with many of the same qualities as a diamond). The company is earning
different profits on different prize slots. You have to help the company in predicting the price for the stone
on the bases of the details given in the dataset so it can distinguish between higher profitable stones and
lower profitable stones so as to have better profit share. Also, provide them with the best 5 attributes that
are most important
Data Dictionary:
Variable Name Description
Carat Carat weight of the cubic zirconia.
Describe the cut quality of the cubic zirconia. Quality is increasing order
Cut
Fair, Good, Very Good, Premium, Ideal.
Color Colour of the cubic zirconia.With D being the worst and J the best.
Clarity refers to the absence of the Inclusions and Blemishes. (In order
Clarity from Worst to Best in terms of avg price) IF, VVS1, VVS2, VS1, VS2, Sl1,
Sl2, l1
The Height of cubic zirconia, measured from the Culet to the table,
Depth
divided by its average Girdle Diameter.
The Width of the cubic zirconia's Table expressed as a Percentage of its
Table
Average Diameter.
Price the Price of the cubic zirconia.
X Length of the cubic zirconia in mm.
Y Width of the cubic zirconia in mm.
Z Height of the cubic zirconia in mm.
1.1 Exploratory Data analysis

Read the data and do exploratory data analysis. Describe the data briefly. (Check the null
values, Data types, shape, EDA, duplicate values). Perform Univariate and Bivariate
Analysis.
First, we will import all necessary libraries and then CSV file (cubic_zirconia.csv) is read in python for
further data analysis.
4
Figure 1: Data types information of the dataset
Here, we can see that there are 11 variables, with 26967 entries. In the initial analysis we can see that
there are some missing values and the dataset consist of both numeric as well as categorical variables.
Categorical: cut, color, clarity
Table 1: Data Summary

The mean and the mid value of carat, table, x and y are almost similar for all the variables
We did a check to see if there are any duplicate records. There are no duplicate records in the dataset.
We will now check for the unique values of the categorical data.
Figure 2: Categorical Data unique values
• There are 5 unique types of cuts with Ideal as the most preferred cut
• The cubic zirconia has the most SI1 clarity which means good.
• With D being the worst and J as best, G is neutral and the count is also high for that color
5
To make the analysis better, we have dropped the Unnamed:0 column just to avoid any further
hinderance.
Table 2: 1st 5 rows of the updated dataset after removing Unnamed:0 column
Univariate analysis
Figure 3: Box Plots of all the numeric variables
• All the variables have some outliers. x, y and z has less outliers as compared to others.
6
Figure 4: Distribution plot of all the numeric variables
• carat, y, z and price are positively skewed. The skewness may be due to the diamonds are always
made in specific shape.
• depth is normally distributed and table is almost normally distributed
Bivariate analysis
Figure 5: Count plot of the categorical variables
• There are 5 unique types of cuts with Ideal as the most preferred cut
• The cubic zirconia has the clarity SI1 which seems to be preferred by people.
• With D being the worst and J as best, G is neutral and the count is also high for that color
Figure 6: Categorical variables w.r.t. price
7
• The reason for the most preferred cut ideal is because those diamonds are priced lower than other
cuts.
• We see the G is priced in the middle of the seven colours, whereas J being the worst colour price
seems too high.
• The cubic zirconia has the clarity SI1 which seems to be preferred by people, since the is
comparatively moderate as compared to others whereas WS1 is also less priced, but least
preferred.
Relationships between the categorical variables
Table 3: color vs cut crosstab
Figure 7: color vs cut crosstab plot
8
Table 4: cut vs clarity crosstab
Figure 8: cut vs clarity crosstab plot
9
Correlations
Figure 9: Correlation plot of all the numeric variables w.r.t. price (individually)
10
Multivariate analysis
Figure 10: Correlation matrix
Figure 11: Correlation matrix heatmap
11
• The x, y, z distribution and price are highly correlated to clarity
• The matrix shows the presence of multi-collinearity in the dataset
1.2 Check for null values and checks for sub level combinations
Impute null values if present, also check for the values which are equal to zero. Do they
have any meaning or do we need to change them or drop them? Check for the possibility
of combining the sub levels of a ordinal variables and take actions accordingly. Explain
why you are combining these sub levels with appropriate reasoning.
In the beginning, we saw that there are some null values in the dataset. We will check for the number
of null values in the dataset.
Figure 12: Check for null values
Figure 13: Check for 0s in the dataset
We can drop the 697 null values since it won’t be affecting the analysis
There are certain rows having values zero, the x, y, z are the dimensions of a diamond so this can’t
take into model.
As there are very less rows, we can drop these rows as don’t have any meaning in model building.
12
Figure 14: Check for null values after removing them
There are now 26958 rows and 10 columns after removing the null values as well as the rows having
0s.
Scaling can be useful to reduce or check the multi collinearity in the data and it will have no impact in
the model score or coefficients of attributes nor the intercept. Scaling is necessary in this dataset to
make the numbers uniform ranging from -3 to +3
Table 5: Scaled dataset
Check for outliers and outliers’ treatment

Previously we saw that there are outliers present in the dataset
Figure 15: Box plot to check outliers

13
We will now remove the outliers and then check the plot
Figure 16: Box plot after treating the outliers
1.3 Splitting the Dataset and create multiple models

Encode the data (having string values) for Modelling. Split the data into train and test
(70:30). Apply Linear regression using scikit learn. Perform checks for significant
variables using appropriate method from statsmodel. Create multiple models and check
the performance of Predictions on Train and Test sets using R-square, RMSE & Adj R-
square. Compare these models and select the best one with appropriate reasoning.
To begin, we need to first create the dummy variables for the categorical variables.
Table 6: Dataset with Dummy variables

We will now separate the target and the predictor variables from the dataset. Here, price is the target
variable.
Table 7: Predictor variables

We will now be splitting the dataset into train and test dataset taking test size as 30%.
We will now check the coefficient values for train dataset of the predictor variables.
14
Here, we can see that the coefficient for carat is much higher as compared to others
We also checked for the regression model score for the train and test data and both the score are
similar
Regression model score (R2) for Train dataset = 0.942
Regression model score (R2) for Test dataset = 0.938
To proceed for the linear regression, we will now join both the predictor and the target variable of the
train dataset
15
Taking depth variable for the modelling
Figure 17: Linear regression Model with depth variable

Avg variance between predicted and actual = 0.2164
16
Removing depth variable for the modelling
Figure 18: Linear regression Model without depth variable

We have calculated the MSE based on the 1st linear regression model = 0.043
RMSE i.e., Root Mean Square Error = 0.207
17
1.4 Inference
Basis on these predictions, what are the business insights and recommendations.
• As per the business problem, from the EDA analysis we could understand the cut, ideal cut had
number profits to the company.
• The most profitable colours are H, I and J.
• The ideal, premium and very good types of cuts were bringing profits where as fair and good
are not bringing profits.
• The predictions were able to capture 95% variations in the price and it is explained by the
predictors in the training set.
• For better accuracy dropping depth column in iteration for better results.
(-0.76) * Intercept + (1.1) * carat + (-0.01) * table + (-0.32) * x + (0.28) * y + (-0.11) * z + (0.1) * cut_
Good + (0.15) * cut_Ideal + (0.15) * cut_Premium + (0.13) * cut_Very_Good + (-0.05) * color_E + (
-0.06) * color_F + (-0.1) * color_G + (-0.21) * color_H + (-0.32) * color_I + (-0.47) * color_J + (1.0) *
clarity_IF + (0.64) * clarity_SI1 + (0.43) * clarity_SI2 + (0.84) * clarity_VS1 + (0.77) * clarity_VS2 +
(0.94) * clarity_VVS1 + (0.93) * clarity_VVS2
Equation 1: Linear regression equation of the model
Recommendations
• The ideal, premium, very good cut types are the most profitable. We can use this for marketing
and attract more customers.
• The next important attribute is the clarity of the diamond. The clearer is the stone the profits
are more.
18
Problem 2: Logistic Regression and LDA
You are hired by a tour and travel agency which deals in selling holiday packages. You are provided
details of 872 employees of a company. Among these employees, some opted for the package and
some didn't. You have to help the company in predicting whether an employee will opt for the package
or not on the basis of the information given in the data set. Also, find out the important factors on the
basis of which the company will focus on particular employees to sell their packages.
Data Dictionary:
Variable Name Description
Holiday_Package Opted for Holiday Package yes/no?
Salary Employee salary
age Age in years
edu Years of formal education
no_young_children The number of young children (younger than 7 years)
no_older_children Number of older children
foreign foreigner Yes/No
2.1 Data Ingestion

Read the dataset. Do the descriptive statistics and do null value condition check, write an
inference on it? Perform Univariate and Bivariate Analysis. Do exploratory data analysis.
First, we will import all necessary libraries and then CSV file (Holiday_Package.csv) is read in python for
further data analysis.
Figure 19: Datatype information of the dataset

Here we can see there are in total 872 records and 8 columns with no null values. Holiday_Package
and foreign are categorical variables, rest all are numeric variables
19
Table 8: 1st 5 rows of the dataset
Here we can see that Unnamed: 0 is nothing but indicates the serial number of the responses, which is
not required for our dataset, so we will remove it. We are now left with 2 categorical and 5 numeric
variables.
Table 9: Data Summary of the dataset (numeric variables)

We will now check the unique count of the categories of the categorical variables
Figure 20: Unique count of the categories

We will now check the proportions of the target variable i.e. Holiday_Package
Figure 21: Proportions of the target variable
20
Univariate analysis
Figure 22: Box Plot of the numeric variables
• Age has no outliers

• Educ and no_older_children and very few outliers.
Figure 23: Distribution plot of the numeric variables
• Salary is positively skewed.

• Age is normally distributed as per the plot above and educ is somewhat normally distributed.
21
Bivariate analysis
Figure 24: Count plot for the categorical variables
Figure 25: Bar plot of the numeric variables w.r.t. Holliday_Package
• People with salary less than 150000 opted for the Holiday packages
22
Figure 26: Bar plot b/w Holiday package and Salary
Figure 27: Scatter and LM plot b/w age and salary
• Employee aged between 50 to 60 seems to not opt for the holiday package
• In the age 30 to 50 with salary less than 50000, have opted for holiday package.
23
Multivariate analysis
Figure 28: Pair plot of all the numeric variables w.r.t Holiday package
• Salary, no_young_children, no_older_children seem to overlap the package opting options

• There seems to be no correlations between the dataset
24
Figure 29: Correlation matrix
Previously we have seen that there we some outliers in the variables. We will now treat the outliers and
look and the output
Figure 30: Box plot of the numeric variables after treating the outliers
25
2.2 and 2.3 Data Split and LR/LDA application along with Performance metrices
Do not scale the data. Encode the data (having string values) for Modelling. Data Split:
Split the data into train and test (70:30). Apply Logistic Regression and LDA (linear
discriminant analysis).
Performance Metrics: Check the performance of Predictions on Train and Test sets
using Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for each
model Final Model: Compare Both the models and write inference which model is
best/optimized.
Logistic Regression
Before proceeding with the analysis, we will create the dummy variables for the categorical variables of
the dataset
Table 10: 1st 5 rows of the dataset with dummy variables

We will now separate the predictor and target variables for further analysis. Here is Holliday_Package
is the target variable.
Now, we will split the data into train and test data with test data size as 30%.
The grid search method is used for logistic regression to find the optimal solving and the parameters
for solving
Figure 31: Grid Search best parameters

We will also calculate y-predict for the analysis.
26
Confusion Matrix
Figure 32: Confusion matrix of the Train and Test dataset
• The accuracy of the Test dataset is more than that of the train dataset
Figure 33: Logistic Regression metrics for Train and test data
• The recall value is same for both whereas the precision and f1 score is higher for test data
Figure 34: AUC Score for train and test data
• The AUC score for the test data is slightly more than that of the train data
27
LDA Model
Here we will take the original dataset for the analysis. (Post dropping the Unnamed: 0 column)
Table 11: 1st 5 rows of the dataset

Unlike in LR model where we created the dummy variables for the categorical variables, in LDA model,
we will assign a numeric value to the categories of the categorical variable
Figure 35: Values assigned to the categories
Figure 36: Datatype info for the dataset

Here, we can see that all the variables are now numeric
We will now separate the predictor and target variables for further analysis. Here is Holliday_Package
is the target variable.
Now, we will split the data into train and test data with test data size as 30%.
We will now check for the confusion matrix
28
Figure 37: Confusion Matrix for Train and test data
• The accuracy for train and test data remains the same as that of the LR model
We will now check the accuracy, F1 score and confusion matrix based on the cutoff score
29
Figure 38: Accuracy, F1 score and confusion matrix based on the cut-off score
30
• Here, we can see that as the cut-off score increases till 0.6, the accuracy score increases, the
from 0/7 cut-off score, the accuracy decreases.
• As the cut-off score increases, the F1 score decreases and became 0
• True Positive value tends to 0 as the cut-off score increases.
Figure 39: AUC score for the train and test data in LDA model
The AUC score remains the same as that of the LR model
Table 12: LR and LDA data for Accuracy, AUC, Recall. Precision and F1 score
We see here that the values for LR and LDA model remains similar.
31
2.4 Inference
Basis on these predictions, what are the insights and recommendations.
• We did the predictions both logistic regression and linear discriminant analysis and both are
results are same.
• As per EDA, we saw that the people aged above 50 are do not opt for holiday packages.
• People ranging from the age 30 to 50 with salary less than 50000 generally opt for holiday
packages.
• Salary, age and educ are the important factors deciding the predictions.
Recommendations
• To improve holiday packages over the age above 50 we can provide packages for places where
older people can relate to and visit like religious places, foreign trips with cheap prices.
• A vacation package can be provided to the people earning more than 150000 and for those who
have a greater number of older children.
32

PM ProjectJune - 2021

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

PM ProjectJune - 2021

Uploaded by

Copyright:

Available Formats

PREDICTIVE

1.1 Exploratory Data analysis

Table 1: Data Summary

Figure 2: Categorical Data unique values

Figure 3: Box Plots of all the numeric variables

Figure 5: Count plot of the categorical variables

Figure 6: Categorical variables w.r.t. price

Relationships between the categorical variables

Table 3: color vs cut crosstab

Figure 7: color vs cut crosstab plot

Figure 8: cut vs clarity crosstab plot

Figure 10: Correlation matrix

Figure 11: Correlation matrix heatmap

Figure 12: Check for null values

Figure 13: Check for 0s in the dataset

Table 5: Scaled dataset

Check for outliers and outliers’ treatment

Figure 15: Box plot to check outliers

Figure 16: Box plot after treating the outliers

1.3 Splitting the Dataset and create multiple models

Table 6: Dataset with Dummy variables

Table 7: Predictor variables

Figure 17: Linear regression Model with depth variable

Figure 18: Linear regression Model without depth variable

Equation 1: Linear regression equation of the model

2.1 Data Ingestion

Figure 19: Datatype information of the dataset

Table 9: Data Summary of the dataset (numeric variables)

Figure 20: Unique count of the categories

Figure 21: Proportions of the target variable

Figure 22: Box Plot of the numeric variables

• Age has no outliers

Figure 23: Distribution plot of the numeric variables

• Salary is positively skewed.

Figure 24: Count plot for the categorical variables

Figure 25: Bar plot of the numeric variables w.r.t. Holliday_Package

Figure 27: Scatter and LM plot b/w age and salary

• Salary, no_young_children, no_older_children seem to overlap the package opting options

Table 10: 1st 5 rows of the dataset with dummy variables

Figure 31: Grid Search best parameters

Figure 32: Confusion matrix of the Train and Test dataset

Figure 34: AUC Score for train and test data

Table 11: 1st 5 rows of the dataset

Figure 35: Values assigned to the categories

Figure 36: Datatype info for the dataset

We will now check for the confusion matrix

You might also like