Download as pdf or txt
Download as pdf or txt
You are on page 1of 33

PREDICTIVE

MODELLING
PGP-DSBA ONLINE JUNE_C 2021

ABHISHEK ROY
Table of Contents
Problem 1: Linear Regression .........................................................................................................................4
Data Dictionary: ..............................................................................................................................................4
1.1 Exploratory Data analysis......................................................................................................................4
Univariate analysis .....................................................................................................................................6
Bivariate analysis .......................................................................................................................................7
Multivariate analysis ................................................................................................................................11
1.2 Check for null values and checks for sub level combinations ..................................................12
Check for outliers and outliers’ treatment .........................................................................................13
1.3 Splitting the Dataset and create multiple models .........................................................................14
1.4 Inference...................................................................................................................................................18
Recommendations ...................................................................................................................................18
Problem 2: Logistic Regression and LDA ..................................................................................................19
Data Dictionary: ............................................................................................................................................19
2.1 Data Ingestion .........................................................................................................................................19
Univariate analysis ...................................................................................................................................21
Bivariate analysis .....................................................................................................................................22
Multivariate analysis ................................................................................................................................24
2.2 and 2.3 Data Split and LR/LDA application along with Performance metrices .....................26
Logistic Regression .................................................................................................................................26
LDA Model ..................................................................................................................................................28
2.4 Inference...................................................................................................................................................32
Recommendations ...................................................................................................................................32

1
List of Figures
Figure 1: Data types information of the dataset ...............................................................................................5
Figure 2: Categorical Data unique values .........................................................................................................5
Figure 3: Box Plots of all the numeric variables ...............................................................................................6
Figure 4: Distribution plot of all the numeric variables ....................................................................................7
Figure 5: Count plot of the categorical variables .............................................................................................7
Figure 6: Categorical variables w.r.t. price .......................................................................................................7
Figure 7: color vs cut crosstab plot ....................................................................................................................8
Figure 8: cut vs clarity crosstab plot...................................................................................................................9
Figure 9: Correlation plot of all the numeric variables w.r.t. price (individually) ........................................10
Figure 10: Correlation matrix.............................................................................................................................11
Figure 11: Correlation matrix heatmap ............................................................................................................11
Figure 12: Check for null values .......................................................................................................................12
Figure 13: Check for 0s in the dataset.............................................................................................................12
Figure 14: Check for null values after removing them ..................................................................................13
Figure 15: Box plot to check outliers................................................................................................................13
Figure 16: Box plot after treating the outliers .................................................................................................14
Figure 17: Linear regression Model with depth variable ...............................................................................16
Figure 18: Linear regression Model without depth variable .........................................................................17
Figure 19: Datatype information of the dataset ..............................................................................................19
Figure 20: Unique count of the categories ......................................................................................................20
Figure 21: Proportions of the target variable ..................................................................................................20
Figure 22: Box Plot of the numeric variables ..................................................................................................21
Figure 23: Distribution plot of the numeric variables .....................................................................................21
Figure 24: Count plot for the categorical variables ........................................................................................22
Figure 25: Bar plot of the numeric variables w.r.t. Holliday_Package ........................................................22
Figure 26: Bar plot b/w Holiday package and Salary ....................................................................................23
Figure 27: Scatter and LM plot b/w age and salary .......................................................................................23
Figure 28: Pair plot of all the numeric variables w.r.t Holiday package......................................................24
Figure 29: Correlation matrix.............................................................................................................................25
Figure 30: Box plot of the numeric variables after treating the outliers ......................................................25
Figure 31: Grid Search best parameters .........................................................................................................26
Figure 32: Confusion matrix of the Train and Test dataset ..........................................................................27
Figure 33: Logistic Regression metrics for Train and test data ...................................................................27
Figure 34: AUC Score for train and test data .................................................................................................27
Figure 35: Values assigned to the categories ................................................................................................28
Figure 36: Datatype info for the dataset ..........................................................................................................28
Figure 37: Confusion Matrix for Train and test data ......................................................................................29
Figure 38: Accuracy, F1 score and confusion matrix based on the cut-off score .....................................30
Figure 39: AUC score for the train and test data in LDA model ..................................................................31

2
List of Tables
Table 1: Data Summary .......................................................................................................................................5
Table 2: 1st 5 rows of the updated dataset after removing Unnamed:0 column .......................................6
Table 3: color vs cut crosstab .............................................................................................................................8
Table 4: cut vs clarity crosstab ...........................................................................................................................9
Table 5: Scaled dataset .....................................................................................................................................13
Table 6: Dataset with Dummy variables ..........................................................................................................14
Table 7: Predictor variables ..............................................................................................................................14
Table 8: 1st 5 rows of the dataset ....................................................................................................................20
Table 9: Data Summary of the dataset (numeric variables) .........................................................................20
Table 10: 1st 5 rows of the dataset with dummy variables ..........................................................................26
Table 11: 1st 5 rows of the dataset ..................................................................................................................28
Table 12: LR and LDA data for Accuracy, AUC, Recall. Precision and F1 score.....................................31

List of Equations
Equation 1: Linear regression equation of the model ...................................................................................18

3
Problem 1: Linear Regression
You are hired by a company Gem Stones co ltd, which is a cubic zirconia manufacturer. You are provided
with the dataset containing the prices and other attributes of almost 27,000 cubic zirconia (which is an
inexpensive diamond alternative with many of the same qualities as a diamond). The company is earning
different profits on different prize slots. You have to help the company in predicting the price for the stone
on the bases of the details given in the dataset so it can distinguish between higher profitable stones and
lower profitable stones so as to have better profit share. Also, provide them with the best 5 attributes that
are most important

Data Dictionary:
Variable Name Description
Carat Carat weight of the cubic zirconia.
Describe the cut quality of the cubic zirconia. Quality is increasing order
Cut
Fair, Good, Very Good, Premium, Ideal.
Color Colour of the cubic zirconia.With D being the worst and J the best.
Clarity refers to the absence of the Inclusions and Blemishes. (In order
Clarity from Worst to Best in terms of avg price) IF, VVS1, VVS2, VS1, VS2, Sl1,
Sl2, l1
The Height of cubic zirconia, measured from the Culet to the table,
Depth
divided by its average Girdle Diameter.
The Width of the cubic zirconia's Table expressed as a Percentage of its
Table
Average Diameter.
Price the Price of the cubic zirconia.
X Length of the cubic zirconia in mm.
Y Width of the cubic zirconia in mm.
Z Height of the cubic zirconia in mm.

1.1 Exploratory Data analysis


Read the data and do exploratory data analysis. Describe the data briefly. (Check the null
values, Data types, shape, EDA, duplicate values). Perform Univariate and Bivariate
Analysis.
First, we will import all necessary libraries and then CSV file (cubic_zirconia.csv) is read in python for
further data analysis.

4
Figure 1: Data types information of the dataset

Here, we can see that there are 11 variables, with 26967 entries. In the initial analysis we can see that
there are some missing values and the dataset consist of both numeric as well as categorical variables.
Categorical: cut, color, clarity

Table 1: Data Summary


The mean and the mid value of carat, table, x and y are almost similar for all the variables
We did a check to see if there are any duplicate records. There are no duplicate records in the dataset.
We will now check for the unique values of the categorical data.

Figure 2: Categorical Data unique values

• There are 5 unique types of cuts with Ideal as the most preferred cut
• The cubic zirconia has the most SI1 clarity which means good.
• With D being the worst and J as best, G is neutral and the count is also high for that color
5
To make the analysis better, we have dropped the Unnamed:0 column just to avoid any further
hinderance.

Table 2: 1st 5 rows of the updated dataset after removing Unnamed:0 column

Univariate analysis

Figure 3: Box Plots of all the numeric variables

• All the variables have some outliers. x, y and z has less outliers as compared to others.

6
Figure 4: Distribution plot of all the numeric variables

• carat, y, z and price are positively skewed. The skewness may be due to the diamonds are always
made in specific shape.
• depth is normally distributed and table is almost normally distributed
Bivariate analysis

Figure 5: Count plot of the categorical variables

• There are 5 unique types of cuts with Ideal as the most preferred cut
• The cubic zirconia has the clarity SI1 which seems to be preferred by people.
• With D being the worst and J as best, G is neutral and the count is also high for that color

Figure 6: Categorical variables w.r.t. price

7
• The reason for the most preferred cut ideal is because those diamonds are priced lower than other
cuts.
• We see the G is priced in the middle of the seven colours, whereas J being the worst colour price
seems too high.
• The cubic zirconia has the clarity SI1 which seems to be preferred by people, since the is
comparatively moderate as compared to others whereas WS1 is also less priced, but least
preferred.

Relationships between the categorical variables

Table 3: color vs cut crosstab

Figure 7: color vs cut crosstab plot

8
Table 4: cut vs clarity crosstab

Figure 8: cut vs clarity crosstab plot

9
Correlations

Figure 9: Correlation plot of all the numeric variables w.r.t. price (individually)

10
Multivariate analysis

Figure 10: Correlation matrix

Figure 11: Correlation matrix heatmap

11
• The x, y, z distribution and price are highly correlated to clarity
• The matrix shows the presence of multi-collinearity in the dataset

1.2 Check for null values and checks for sub level combinations
Impute null values if present, also check for the values which are equal to zero. Do they
have any meaning or do we need to change them or drop them? Check for the possibility
of combining the sub levels of a ordinal variables and take actions accordingly. Explain
why you are combining these sub levels with appropriate reasoning.
In the beginning, we saw that there are some null values in the dataset. We will check for the number
of null values in the dataset.

Figure 12: Check for null values

Figure 13: Check for 0s in the dataset

We can drop the 697 null values since it won’t be affecting the analysis
There are certain rows having values zero, the x, y, z are the dimensions of a diamond so this can’t
take into model.
As there are very less rows, we can drop these rows as don’t have any meaning in model building.

12
Figure 14: Check for null values after removing them
There are now 26958 rows and 10 columns after removing the null values as well as the rows having
0s.
Scaling can be useful to reduce or check the multi collinearity in the data and it will have no impact in
the model score or coefficients of attributes nor the intercept. Scaling is necessary in this dataset to
make the numbers uniform ranging from -3 to +3

Table 5: Scaled dataset

Check for outliers and outliers’ treatment


Previously we saw that there are outliers present in the dataset

Figure 15: Box plot to check outliers


13
We will now remove the outliers and then check the plot

Figure 16: Box plot after treating the outliers

1.3 Splitting the Dataset and create multiple models


Encode the data (having string values) for Modelling. Split the data into train and test
(70:30). Apply Linear regression using scikit learn. Perform checks for significant
variables using appropriate method from statsmodel. Create multiple models and check
the performance of Predictions on Train and Test sets using R-square, RMSE & Adj R-
square. Compare these models and select the best one with appropriate reasoning.
To begin, we need to first create the dummy variables for the categorical variables.

Table 6: Dataset with Dummy variables


We will now separate the target and the predictor variables from the dataset. Here, price is the target
variable.

Table 7: Predictor variables


We will now be splitting the dataset into train and test dataset taking test size as 30%.
We will now check the coefficient values for train dataset of the predictor variables.

14
Here, we can see that the coefficient for carat is much higher as compared to others
We also checked for the regression model score for the train and test data and both the score are
similar
Regression model score (R2) for Train dataset = 0.942
Regression model score (R2) for Test dataset = 0.938
To proceed for the linear regression, we will now join both the predictor and the target variable of the
train dataset

15
Taking depth variable for the modelling

Figure 17: Linear regression Model with depth variable


Avg variance between predicted and actual = 0.2164

16
Removing depth variable for the modelling

Figure 18: Linear regression Model without depth variable


We have calculated the MSE based on the 1st linear regression model = 0.043
RMSE i.e., Root Mean Square Error = 0.207

17
1.4 Inference
Basis on these predictions, what are the business insights and recommendations.
• As per the business problem, from the EDA analysis we could understand the cut, ideal cut had
number profits to the company.
• The most profitable colours are H, I and J.
• The ideal, premium and very good types of cuts were bringing profits where as fair and good
are not bringing profits.
• The predictions were able to capture 95% variations in the price and it is explained by the
predictors in the training set.
• For better accuracy dropping depth column in iteration for better results.
(-0.76) * Intercept + (1.1) * carat + (-0.01) * table + (-0.32) * x + (0.28) * y + (-0.11) * z + (0.1) * cut_
Good + (0.15) * cut_Ideal + (0.15) * cut_Premium + (0.13) * cut_Very_Good + (-0.05) * color_E + (
-0.06) * color_F + (-0.1) * color_G + (-0.21) * color_H + (-0.32) * color_I + (-0.47) * color_J + (1.0) *
clarity_IF + (0.64) * clarity_SI1 + (0.43) * clarity_SI2 + (0.84) * clarity_VS1 + (0.77) * clarity_VS2 +
(0.94) * clarity_VVS1 + (0.93) * clarity_VVS2

Equation 1: Linear regression equation of the model

Recommendations
• The ideal, premium, very good cut types are the most profitable. We can use this for marketing
and attract more customers.
• The next important attribute is the clarity of the diamond. The clearer is the stone the profits
are more.

18
Problem 2: Logistic Regression and LDA
You are hired by a tour and travel agency which deals in selling holiday packages. You are provided
details of 872 employees of a company. Among these employees, some opted for the package and
some didn't. You have to help the company in predicting whether an employee will opt for the package
or not on the basis of the information given in the data set. Also, find out the important factors on the
basis of which the company will focus on particular employees to sell their packages.

Data Dictionary:
Variable Name Description
Holiday_Package Opted for Holiday Package yes/no?
Salary Employee salary
age Age in years
edu Years of formal education
no_young_children The number of young children (younger than 7 years)
no_older_children Number of older children
foreign foreigner Yes/No

2.1 Data Ingestion


Read the dataset. Do the descriptive statistics and do null value condition check, write an
inference on it? Perform Univariate and Bivariate Analysis. Do exploratory data analysis.
First, we will import all necessary libraries and then CSV file (Holiday_Package.csv) is read in python for
further data analysis.

Figure 19: Datatype information of the dataset


Here we can see there are in total 872 records and 8 columns with no null values. Holiday_Package
and foreign are categorical variables, rest all are numeric variables

19
Table 8: 1st 5 rows of the dataset
Here we can see that Unnamed: 0 is nothing but indicates the serial number of the responses, which is
not required for our dataset, so we will remove it. We are now left with 2 categorical and 5 numeric
variables.

Table 9: Data Summary of the dataset (numeric variables)


We will now check the unique count of the categories of the categorical variables

Figure 20: Unique count of the categories


We will now check the proportions of the target variable i.e. Holiday_Package

Figure 21: Proportions of the target variable

20
Univariate analysis

Figure 22: Box Plot of the numeric variables

• Age has no outliers


• Educ and no_older_children and very few outliers.

Figure 23: Distribution plot of the numeric variables

• Salary is positively skewed.


• Age is normally distributed as per the plot above and educ is somewhat normally distributed.

21
Bivariate analysis

Figure 24: Count plot for the categorical variables

Figure 25: Bar plot of the numeric variables w.r.t. Holliday_Package

• People with salary less than 150000 opted for the Holiday packages

22
Figure 26: Bar plot b/w Holiday package and Salary

Figure 27: Scatter and LM plot b/w age and salary

• Employee aged between 50 to 60 seems to not opt for the holiday package
• In the age 30 to 50 with salary less than 50000, have opted for holiday package.

23
Multivariate analysis

Figure 28: Pair plot of all the numeric variables w.r.t Holiday package

• Salary, no_young_children, no_older_children seem to overlap the package opting options


• There seems to be no correlations between the dataset

24
Figure 29: Correlation matrix
Previously we have seen that there we some outliers in the variables. We will now treat the outliers and
look and the output

Figure 30: Box plot of the numeric variables after treating the outliers

25
2.2 and 2.3 Data Split and LR/LDA application along with Performance metrices
Do not scale the data. Encode the data (having string values) for Modelling. Data Split:
Split the data into train and test (70:30). Apply Logistic Regression and LDA (linear
discriminant analysis).
Performance Metrics: Check the performance of Predictions on Train and Test sets
using Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for each
model Final Model: Compare Both the models and write inference which model is
best/optimized.
Logistic Regression
Before proceeding with the analysis, we will create the dummy variables for the categorical variables of
the dataset

Table 10: 1st 5 rows of the dataset with dummy variables


We will now separate the predictor and target variables for further analysis. Here is Holliday_Package
is the target variable.
Now, we will split the data into train and test data with test data size as 30%.
The grid search method is used for logistic regression to find the optimal solving and the parameters
for solving

Figure 31: Grid Search best parameters


We will also calculate y-predict for the analysis.

26
Confusion Matrix

Figure 32: Confusion matrix of the Train and Test dataset

• The accuracy of the Test dataset is more than that of the train dataset

Figure 33: Logistic Regression metrics for Train and test data

• The recall value is same for both whereas the precision and f1 score is higher for test data

Figure 34: AUC Score for train and test data

• The AUC score for the test data is slightly more than that of the train data

27
LDA Model
Here we will take the original dataset for the analysis. (Post dropping the Unnamed: 0 column)

Table 11: 1st 5 rows of the dataset


Unlike in LR model where we created the dummy variables for the categorical variables, in LDA model,
we will assign a numeric value to the categories of the categorical variable

Figure 35: Values assigned to the categories

Figure 36: Datatype info for the dataset


Here, we can see that all the variables are now numeric
We will now separate the predictor and target variables for further analysis. Here is Holliday_Package
is the target variable.
Now, we will split the data into train and test data with test data size as 30%.

We will now check for the confusion matrix

28
Figure 37: Confusion Matrix for Train and test data

• The accuracy for train and test data remains the same as that of the LR model
We will now check the accuracy, F1 score and confusion matrix based on the cutoff score

29
Figure 38: Accuracy, F1 score and confusion matrix based on the cut-off score
30
• Here, we can see that as the cut-off score increases till 0.6, the accuracy score increases, the
from 0/7 cut-off score, the accuracy decreases.
• As the cut-off score increases, the F1 score decreases and became 0
• True Positive value tends to 0 as the cut-off score increases.

Figure 39: AUC score for the train and test data in LDA model
The AUC score remains the same as that of the LR model

Table 12: LR and LDA data for Accuracy, AUC, Recall. Precision and F1 score
We see here that the values for LR and LDA model remains similar.

31
2.4 Inference
Basis on these predictions, what are the insights and recommendations.
• We did the predictions both logistic regression and linear discriminant analysis and both are
results are same.
• As per EDA, we saw that the people aged above 50 are do not opt for holiday packages.
• People ranging from the age 30 to 50 with salary less than 50000 generally opt for holiday
packages.
• Salary, age and educ are the important factors deciding the predictions.

Recommendations
• To improve holiday packages over the age above 50 we can provide packages for places where
older people can relate to and visit like religious places, foreign trips with cheap prices.
• A vacation package can be provided to the people earning more than 150000 and for those who
have a greater number of older children.

32

You might also like