Capstone Project SupplyChain DataCo Supplychain FinalReport

Capstone-Project-Supplychain-Dataco- Final Report
LATE DELIVERY RISK PREDICTOR FOR

DATA CO. SUPPLY CHAIN
CAPSTONE PROJECT FINAL REPORT
PGPBABI- ONLINE BATCH JANUARY 2020
SUBMITTED BY: K.P.HARIHARAN
Page 1 of 79
ACKNOWLEDGEMENTS
First of all, I wish to express my deepest gratitude to all the faculty members of Great Learning for their
excellent guidance and continuous support to enhance my learning in Business Analytics and Business
Intelligence.
My sincere note of gratitude with many thanks to my mentors Mr. Amit Kulkarni, Mr. Nimesh Marfatia
and coaches Mr. Animesh Tiwari, Mr. Sarabjeet Singh Kochar, Ms Karuna Kumari, for making the
learning experience more profound by discerning the complex subjects into simple explanations that
helped me to understand the subject and most important in its context precisely.
I take this opportunity to Thank the Program Office of Great Learning, Ms. Richa for helping me through
different stages of this curriculum.
Last but not least, to my family for their unconditional support and encouragement.
ABSTRACT & LITERATURE REVIEW

Literature Review: -
Purpose
This report presents the Late Delivery Predictor model that can help Data Co. Supply chain to predict
the risk of late delivery expected in the supply chain delivery
Design/methodology/approach
A review was conducted to identify classification algorthims that can provide the best results which are
regression, frequency based, decision trees and ensemble methods.
Findings
The final report identifies the impact late delivery and further provides the model that can predict late
delievery . Furthermore, the paper develops a roadmap framework for future research and practice.
Practical implications
The proposed work is useful for Data Co. Supply chain both business and data practioners as it outlines
the components for every supply chain transformation. It also proposes collection of some more data to
improve the model.
Abstract: -
The main objective of this capstone project is to develop a Late Delivery Predictor model that can
help Data Co. Supply chain to predict the risk of late delivery expected in the supply chain delivery.
The contribution of this project, which is presented in this final report hence, is to showcase the various
predictive models that can predict Late Delivery was developed with the data provided by Data Co.
Supply Chain, by using various renowned machine learning, data modelling techniques and algorithms
like:
 Regression based - Logistic Regression

 Frequency based - Naïve Bayes, KNN alias K Nearest Neighbours
 Decision Trees- CART, Random Forest
 Ensemble Methods- Bagging and Boosting (XG Boosting)
R Studio was used as software tool to build the predictive models and Tableau software was used for
data visualisation in this project.
Page 2 of 79
The output of the various models that were built using the aforesaid techniques was then evaluated
using performance metrics like Confusion Matrix, ROC, Gini Index (as applicable) and the results were
from each model were compared to identify the best performed model, which is recommended to
business is presented in this report.
This report also shares Business Insights and findings from the data provided and recommendations
hence to make business successful using the Late Predictor tool.
Keywords:
Missing data, Outliers, Capping Technique, Central Tendency, Multi collinearity, Clustering- PCA-FA,
Feature Selection, Scaling, Sample Split, over fit, under fit, Regression, Frequency Based, Decision
Trees, Ensemble Methods, Bagging, Boosting, Confusion Matrix, ROC-AUC, GINI Index, Best Model
TABLE OF CONTENTS
Table of Contents
ACKNOWLEDGEMENTS................................................................................................................2
ABSTRACT & LITERATURE REVIEW.................................................................................................2
Purpose.............................................................................................................................. 2
Design/methodology/approach...........................................................................................2
Findings.............................................................................................................................. 2
Practical implications.......................................................................................................... 2
TABLE OF CONTENTS.................................................................................................................... 3
LIST OF TABLES............................................................................................................................ 5
LIST OF FIGURES.......................................................................................................................... 6
ABBREVIATIONS.......................................................................................................................... 7
SECTION 1: INTRODUCTION, PROBLEM, OBJECTIVES, SCOPE, DATA SOURCES, METHODOLOGY.........9
1.1 Introduction..................................................................................................................9
1.2 The Problem Statement.............................................................................................10
1.3 Objectives of the study..............................................................................................10
1.4 Scope........................................................................................................................ 11
1.5 Data Source............................................................................................................... 11
1.6 Methodology.............................................................................................................. 12
SECTION 2: EXPLORATORY DATA ANALYSIS INCLUDING DATA PREPARATION, CLEANING AND
IMPUTATION............................................................................................................................. 12
2.1 Variable Identification................................................................................................12
2.2 Univariate and Bivariate analysis...............................................................................13
2.3 Missing Value Treatment...........................................................................................15
Page 3 of 79
2.4 Outlier Treatment.......................................................................................................15

2.5 Check for Multicollinearity...........................................................................................16
2.6 Data Preparation – Feature scaling, Balancing and Clustering..................................18
2.7 Variable transformation..............................................................................................22
2.8 Feature election......................................................................................................... 22
2.9 EDA – Data Preparation Summary............................................................................23
SECTION 3: ALL MODEL DEVELOPMENT INCLUDING TESTING OF ASSUMPTIONS AND PERFORMANCE
EVALUATION METRICS............................................................................................................... 24
3.1 Applying Logistic regression, Model tuning, Model evaluation & Interpret results......25
3.2 Applying Naïve bayes, Model tuning, Model evaluation & Interpret results................36
3.3 Applying KNN, Model tuning, Model evaluation & interpret results.............................38
3.4 Applying CART, Model tuning, Model evaluation & Interpret results..........................42
3.5 Applying Random forest, Model tuning, Model evaluation & Interpret results.............48
3.6 Applying BAGGING, Model tuning, Model evaluation & Interpret results...................53
3.7 Applying BOOSTING, Model tuning, Model evaluation & Interpret results.................55
3.8 Model Validation to Compare Models and Find The Best Performed Model..............57
SECTION 4: FINDINGS & INSIGHTS, DATA CONSTRAINTS & MODEL INTERPRETATION......................58
4.1 Findings & Business Insights.....................................................................................58
4.2 Data constraints & Model Interpretation.....................................................................59
SECTION 5: CHALLENGES FACED DURING RESEARCH OF PROJECT AND TECHNIQUES USED TO
OVERCOME THE CHALLENGES.....................................................................................................60
SECTION 6: RECOMMENDATIONS, CONCLUSIONS/APPLICATIONS.................................................60
SECTION 7: BIBILIOGRAPHY........................................................................................................ 61
APPENDIX A.............................................................................................................................. 62
A1 R-SOURCE CODE............................................................................................................. 62
A2 TABELEAU VISUALISATION SOURCE CODE................................................................62
A3 UNIVARIATE ANALYSIS..................................................................................................66
A4 BIVARIATE ANALYSIS.....................................................................................................74
Page 4 of 79
LIST OF TABLES
Table 2. 1 - Univariate- Bivariate study summary and recommended actions..........................13
Table 2. 2 - Correlation Study Categoric variables- Chi Square Test.......................................17
Table 2. 3- Scaled- Numeric Variables output..........................................................................18
Table 2. 4- Scaled- Numeric Variables output..........................................................................19
Table 2. 5 - Factors interpretation with labels........................................................................21Y
Table 3. 1– Logistic Regression- Confusion Matrix-Train Data................................................32
Table 3. 2 – Logistic Regression- Confusion Matrix-Test Data................................................32
Table 3. 3 – Logistic Regression Tuned- Confusion Matrix-Train Data....................................33
Table 3. 4 – Logistic Regression Tuned- Confusion Matrix-Test Data.....................................34
Table 3. 5 – Logistic Regression Tuned- Confusion Matrix-Test Data.....................................34
Table 3. 6 – Logistic Regression Tuned- Final Results-Test Data...........................................35
Table 3. 7– Naive Bayes- Confusion Matrix on Test Data........................................................37
Table 3. 8 – Naive Bayes- Confusion Matrix Tuned- Final Results-Test Data..........................37
Table 3. 9 - KNN - Confusion Matrix Test Data- K = 19...........................................................39
Table 3. 10 – KNN - Confusion Matrix Test Data- K = 9...........................................................40
Table 3. 11– KNN - Confusion Matrix Test Data- K = 29..........................................................40
Table 3. 12 – KNN - Confusion Matrix Tuned Model- Test Data- K = 9....................................41
Table 3. 13 – KNN - Confusion Matrix Tuned- Final Results-Test Data...................................41
Table 3. 14– CART - Confusion Matrix Tuned- Results on Train Data.....................................46
Table 3. 15 – CART - Confusion Matrix Tuned- Results on Test Data.....................................47
Table 3. 16 – CART - Confusion Matrix Tuned- Final Results-Test Data.................................47
Table 3. 17 – Random Forest - Confusion Matrix Tuned- Results on Train Data.....................51
Table 3. 18 – Random Forest - Confusion Matrix Tuned- Results on Test Data......................51
Table 3. 19 – Random Forest - Confusion Matrix Tuned- Final Results-Test Data..................52
Table 3. 20 – Bagging - Confusion Matrix Tuned- Results on Test Data..................................54
Table 3. 21 – Bagging - Confusion Matrix Tuned- Final Results-Test Data..............................54
Table 3. 22 – Bias Vs Variance................................................................................................55
Table 3. 23 – Boosting - Confusion Matrix Tuned- Results on Test Data.................................56
Table 3. 24 – Boosting - Confusion Matrix Tuned- Final Results-Test Data.............................56
Table 3. 25 – Model Selection- Comparison Matrix..................................................................57
Page 5 of 79
LIST OF FIGURES
YFig 1. 1- Data Analytics Life Cycle
Fig 1. 2-The Business Problem Understanding........................................................................10
Fig 1. 3 - The Data Report 11
Y
Fig 2. 1- Box plot BEFORE Outlier treament............................................................................15
Fig 2. 2- Box plot AFTER Outlier treament...............................................................................15
Fig 2. 3- Correlation Plot Numeric variables- By Indicators......................................................16
Fig 2. 4- Correlation Plot Numeric variables- By Numbers.......................................................17
Fig 2. 5 - Scree Plot – Eigen Values of Components...............................................................20
Fig 2. 6 - FA Diagram – Rotation None....................................................................................21
Fig 2. 7-EDA- Data Preparation, Clearning, Imputation- Summary............................................2
Fig 3. 1 - Logistic Regression- ROC-AUC Charts....................................................................36
Fig 3. 2- KNN- Classification Method:......................................................................................39
Fig 3. 3 - CART Tree Before Pruning.......................................................................................42
Fig 3. 4- CART Complexity Parameter-Visualisation................................................................44
Fig 3. 5 - CART Pruned Tree...................................................................................................44
Fig 3. 6- CART – ROC- AUC Chart..........................................................................................48
Fig 3. 7- Random Forest Train Trees Vs Error.........................................................................49
Fig 3. 8- Random Forest Variable Importance.........................................................................50
Fig 3. 9 - Random Forest TEST- ROC Curve..........................................................................53
Page 6 of 79
ABBREVIATIONS
Term (short form) Definition (full form) Description
AUC Area Under the (ROC) Curve Diagnostic for classifier efficiency, if AUC is 1.0 is a
perfect classifier
BDA Big Data Analytics Advanced analytics technique against very large,
diverse data
CART Classification & Regression Trees Tree based methodology for prediciton
CP Complexity Parameter Parameter which is used to control size of decision tree
EDA Exploratory Data Analysis Approach of data analysis employs various graphical
tecnique
GINI Measure of Inequality Measure of statistics disperson
IOT Internet of Things Pointing Internet Connected Objects
KNN K Nearest Neighbours Distance methodology for prediction
LR Logistic Regression Regression methodology for prediction
NB Naive Bayes Frequency Based methodology for prediction
PCA-FA Principal component Analysis- Clustering Technique variance based approach

Factor Analysis
ROC Reciever Operating Characteristic Graphical plot as diagnostic of ability of binary classifier
SCM Supply Chain Management Handling of entire production flow of a good or service
TP/FP True Positive/False Positive TP = outcome where the model correctly predicts
the positive class. FP= Incorrectly predicts positive
class
TN/FN True Negative/Flase Negative TN = outcome where the model correctly predicts
the negative class. FN= Incorrectly predicts negative
class
Page 7 of 79
VIF Variable Inflation Factor Measure of amount of multicollinearity
Page 8 of 79
SECTION 1: INTRODUCTION, PROBLEM, OBJECTIVES, SCOPE, DATA

SOURCES, METHODOLOGY
1.1 INTRODUCTION
Big data analytics (BDA) in Supply Chain Management (SCM) is receiving a growing attention in the
recent past. This is due to the fact that BDA has a wide range of applications in SCM, including
customer behaviour analysis, trend analysis, and demand prediction.
A variety of statistical analysis techniques have been used in SCM in the areas of demand forecasting,
time series analysis and regression analysis. With advancement in information technologies and
improved computational efficiencies, Big data analytics (BDA) has emerged as a means of arriving at
more precise predictions that better reflect customer needs, facilitate assessment of Supply
Chain performance, improve the efficiency of supply chain, reduce reaction time and support
Supply Chain risk assessment.
With SCM efforts aiming at satisfying customer demand while minimising the total cost of supply,
applying Machine Learning- Data Analytics algorithms could facilitate precise (data driven) demand
forecasts and align supply chain activities with these predictions to improve efficiency and customer
satisfaction.
How to use Data science to solve business problems?
Fig 1. 1- Data Analytics Life Cycle
The above figure (Fig 1.1), explains the steps involved in data analytics life cycle. The first and
foremost step is to identify the problem and understand the business need for the study, followed by
data collection and visually interpreting the data, and later perform EDA (Exploratory Data Analysis)
which involves both data cleaning and data exploration, followed by feature engineering to identify
which are the relevant variables for the model out of large set of variables in the data set, once we are
clear with the variable and then to identify which modelling techniques to be used and once the model
is build we can evaluate the model using model evaluation techniques to find the optimal model and
provide final recommended model.
In this project we have used the approach of Data Analytics life cycle and have simulated supply chain
process of the company – Data Co. using the data set provided by the company.
Page 9 of 79
In this data set the problem identified is late delivery and a prediction model is needed to
identify if a particular product is going to reach the customer on time (or) delayed, which is
classification type of problem.
We worked on various modelling techniques which are classification oriented algorithms like logistic
regression, random forest, CART, Naïve Bayes, KNN, later models were evaluated using model
evaluation methods like Confusion Matrix, ROC, AUC etc.
1.2 THE PROBLEM STATEMENT

The data provided relates to the delivery activty in the supply chain process of the company – Data
Co. The underlying problem associated with the data is there are late deliveries which lead to bad
customer experience, which affects the profitability- both Top Line and Bottom line, decrease in Sales
as depicted in (Fig 1.2) and hence to understand the problem a study was conducted with the
provided data set by the company to :
a. Analyse the timelines of deliveries
b. Adherence to the stipulated timelines of the delivery (Committed timelines are met/or not)
c. Reasons for delay in the given set of transactions/orders
Fig 1. 2-The Business Problem Understanding
1.3 OBJECTIVES OF THE STUDY

The objective of this study are to:
1. Analyse the the timelines of the delivery
2. Identify adherence to the stipulated timelines of the delivery (Committed timelines are met/ or
not)
3. Analyse reasons for the delays in the given set of transactions/orders
4. Build the model that can predict late delivery using various classification oriented data
modelling techniques like Logistic Regression, Naïve Bayes, KNN, CART, Random Forest,
Ensemble methods- bagging, boosting.
5. Test various performance metrics – Confusion Matrix, AUC, ROC etc. as applicable
6. Fine Tuning the Model parameters
7. Identify and interpret the best model
8. Share Business Insights and Recommendation.
Page 10 of 79
1.4 SCOPE
The Scope of this study is limited to the data set provided by Data Co. Supply chain and using the
models mentioned in the objectives
1.5 DATA SOURCE

The given dataset contained information about a company called “Data Co. Global” on its activities
related to Provisioning, Production, Sales, Commercial distribution of various consumer goods.
Data that is gathered had 180519 rows/records having 53 attributes / variables. The data
contained both Quantitative: Numerical Variables that have are measured on a numeric or
quantitative scale and Qualitative variable, also called a categorical variable, are variables that are
not numerical.
The Quantitiative variables can be further subgrouped as a. Discrete- Whole numbers- typically
counts e.g. Number of visits, Number of attendees b. Continuous- can take on almost any numeric
value and can be meaningfuly divided in to smaller increments, fractions, decimals e.g. height, weight,
temperature. The Qualittative variables can be further subgrouped in to a. Nominal – that do not have
natural order or ranking which are mutually exclusive e.g. zip code, gender type, b. Ordinal- Ordered
categories which are mutually exclusive ocio economic status (“low income”,” middle income”,”high
income”), education level (“high school”,”BS”,”MS”,”PhD”), income level (“less than 50K”, “50K-100K”,
“over 100K”), satisfaction rating (“extremely dislike”, “dislike”, “neutral”, “like”, “extremely like”).
The data provided has been collected for a period of 3 years on a daily basis from January 2015 to
December 2017 and January, February of 2018. The Data in this context can be categorised (or)
grouped in to 6 categories. The taxonomy of the data is represented in the below diagram (Fig 1.3) for
better understanding of the underlying data.
Fig 1. 3 - The Data Report
DataCo. Supply Chain Data 53
Stores Customers Products Sales Orders Shipping Delivery
Customer City Scheduled

Customer Country Sales
Sales per Cust 3 Real 4
Customer Email Shipping date
Customer Fname Type
Shipping mode
Customer Lname
Customer ID
Cust password
Cust Segment Benefits per ord
Cust State Market
Cust Street 11 Order city Delivery Status
Order Country Late_delivery_risk
Cust Zipcode
Order cust id
Order Date
2
Order Id 20
Order cardprodid
4 Order item disc
Category Id Ord Item disc rate
Department ID Category Name Order item id
Deoartment Name Product Card Id Orditem prodprice
Latitude
Longitude
Product Category Id
Product Desc 9 Orditem prftratio
Orditem qty
Product Image Orditem ttl
Product Name Page 11 of 79Ordprft per ord
Product Price Ord region
Product Status Ord State
Ord status
Ord zip code
1.6 METHODOLOGY
The approach that was used to resolve the afore stated problem in the case study was by using
machine learning and prediction modelling techniques like Logistic Regression, Naïve Bayes, KNN,
CART, Random Forest, Ensembling Modelss and using R studio as the software tool.
Please refer to Appendix A of source code for libraries and packages of R that were used for this
case study
SECTION 2: EXPLORATORY DATA ANALYSIS INCLUDING DATA

PREPARATION, CLEANING AND IMPUTATION
The below exploratory data analysis was conducted with the data set
2.1. Variable Identification
2.2. Univariate and Bivariate Analysis
2.3. Missing Value Treatment
2.4. Outlier Treatment
2.5. Check for Multi collinearity
2.6. Data preparation- Feature Scaling, Balancing, Clustering
2.7. Variable Transformation
2.8. Feature Exploration
2.1 VARIABLE IDENTIFICATION

This data set has 53 Variables/objects/columns and 180519 observations/rows, preliminary study
was conducted to understand the variables.
 The variable Product Image has http link to the images but could not be read in R and hence all
values of this variable are being read as NAs in R. Hence, This variable was removed.
 The entire Product Description column is null (there is no single data entered in this column).
Hence, This variable was removed.
 Order Zipcode has huge no.of NAs (155679- 86% data not available). Hence,This variable was
removed.
 Customer Password is masked showing XX. Hence,This variable was removed
 The dataset contains geospatial data variables i.e. Latitude and Longitude and also Date variables
like Order date (DateOrders) & shipping date (DateOrders) ( though are showing as numeric
variables in format)
 There are few variables like Product Status, Late_delivery_risk which is numeric data type in the
dataset, but is actually are categorical variable, hence conversion in to factors were done.
 Outliers are present in many numeric variables, further study was conducted and treatment of
outliers were performed.
 Customer Zip code has 3 missing data, this data was imputed by finding nearest neighbours.
Considering above we cannot run ad-hoc analysis, hence there is a need to identify variables which
are important in order to evaluate the Late delivery risk.
First Level Check of Variables for predicting capabilities

A preliminary check was performed to understand independent variables predicting capabilities and
following were identified:
 Variables related to Customer information such as Country – country from where the purchase
was made, Customer email address details and other Customer details such as Customer ID,
Customer Name, Customer password, Customer Segment, Customer address and Department
ID, Customer Zip code are less important for the analysis to be done.
 Variables with higher missing values were removed (as represented in missing value
identification)
Page 12 of 79
Further, they do not have much relevance in evaluating the Late Delivery Risk and reasons for late
delivery since these don’t contain the information of location where the product was shipped (i.e. the
store from where the product was shipped).
Also Customer Segment also has no relevance since the product has been ordered for a different
customer in a different location. We do not have customer segment information related to final end
user of the product. Hence product related information except for Product Price do not possess
predicting capability of Late delivery risk, hence were removed.
From the Bi-variate analysis performed available in the next chapter (2.2 Univariate and Bivariate
analysis) certain variables related to Order variability with respect to Late delivery risk was low, hence
removed
Hence, the following 28 variables were removed for further analysis from the given data set.
Category ID, Category Name, Customer City, Customer Country, Customer Email, Customer Fname,
Customer ID, Customer Lname, Customer Password, Customer Segment, Customer State, Customer
Street, Customer Zipcode, Department Name, Order Customer Id, Order date, Order id, Order Item
Cardprodid, Order Item Id, Order Zip code, Product Card ID, Product Category Id, Product Description,
Product Image, Product Name, Product status, Shipping date.
The remaining 25 variables (including the target variable) were taken for further analysis and model
building hence.
2.2 UNIVARIATE AND BIVARIATE ANALYSIS

What is Univariate Analysis: -
Uni means One – It is method of picking one variable and analysing the data observations pertaining
to that variable using descriptive statistics methods like Histograms, Density plots, Box plots to
understand data patterns and distribution of the data.
However, Univariate analysis does not deal with cause, relationship etc. and its major purpose is to
describe the data and summarising the patterns in the data. Univariate analysis was conducted for
both numeric and categorical variables.
The output of the Univariate study is available in Appendix A
What is Bivariate Analysis: -

Bi-Variate analysis as the name indicates involves simultaneous analysis of two variables for the
purpose of determining empirical relationship between them. Bi variate analysis helps to test the
hypothesis of association i.e. It explores the concept of relationship between two variables in terms of
a. Whether there exists an association and strength of this association
b. Or whether there are differences between two variables and significance of the difference
In this data study there are Numerical and Categorical variables. The dependent variable can be
assessed as both numeric and categorical.
In this data study there are Numerical and Categorical variables. The dependent variable is a
Categorical variable and the Independent variables are both Numeric and Categorical.
Following bi-variate analysis were performed for this data set.

a. Categorical Vs Numerical – Box plots
b. Categorical Vs Categorical – Bar Plots
c. Numerical Vs Numerical –Scatter Plots and Linear Correlation
The output of the Bivariate study is available in Appendix A

Assessment from Univariate and Bi Variate study of the variables conducted is summarised in below
table (Table 2.1)
Page 13 of 79
Table 2. 1 - Univariate- Bivariate study summary and recommended actions
Numeric Variable(s) Univariate Study Bivariate Study Recommendations
Days for shipping actual Right skewed, no outliers Less correlation with other Variable can be considered for model
independent variables building
Days for shipping scheduled Right skewed, no outliers Less correlation with other Variable can be considered for model
Benefits per order Left skewed, many outliers High correlation with order item Outlier treatment and multi
profit ratio, order profit per order collinearity treatment needed
Sales per customer Right skewed, many outliers High correlation with Sales, Outlier treatment and multi
product price, order item product collinearity treatment needed
price
Order Item discount Left skewed, no outliers Less correlation with other Variable can be considered for model
Order item product price Right skewed, few outliers High correlation with Product Outlier treatment and multi
Price collinearity treatment needed
Order item profit ratio Left skewed, many outliers High correlation with order item Outlier treatment and multi
profit per order collinearity treatment needed
Sales Right skewed, few outliers High correlation with Sales per Outlier treatment and multi
customer and order item total collinearity treatment needed
Order item total Right skewed, many outliers High correlation with product Outlier treatment and multi
price collinearity treatment needed
Order profit per order Left skewed, many outliers High correlation with order item Outlier treatment and multi
profit ratio collinearity treatment needed
Product Price Right skewed, few outliers High correlation with order item Outlier treatment and multi
product price collinearity treatment needed
Categorical Variable(s) Univariate Study Bivariate Study Recommendations
Type Debit-38% highest, By Cash- Correlated to dependent variable Lesser Cash, considered for model
11% less building
Delivery status Late Delivered 55% Associated with late delivery risk Not considered for model building
Late Delivery risk Risk -55% Is the Dependent variable Risk is high, mitigation needed. Is the
dependent variable
Product status Availability 100% Lesser influence on the Better Inventory, product related no
dependent variable influence on dependent variable
Order status 56% of orders are Open Correlated to dependent variable Expect payment delays, considered
for model building
Shipping mode 60% standard class, 20% Correlated to dependent variable Efficient supply chain needed,
faster delivery considered for model building
Customer City, country - Customer City and country are Not considered for model building
highly correlated
Order city, country, region - Order city, country and region are Not considered for model building
highly correlated
Page 14 of 79
2.3 MISSING VALUE TREATMENT

There were few missing value identified in the data set, Customer zip code- 3, Order Zip code-
155679, Product description 180519, due to higher missing values Order Zip code and Product
description considered were dropped as variable as per the best practice of more than 15-20% of
missing values such features should be removed from the model building. Hence no missing value
treatment was needed for this data set.
2.4 OUTLIER TREATMENT

Outlier Identification was conducted for the numeric variables in the dataset and Boxplot review was
conducted to identify outliers with the subset of variables identified in the previous step.
Box plot shows there are outliers in most of the numeric variables. Since the logistic regression
models are sensitive to outliers, hence the outliers were treated by capping technique using central
tendency as median.
Box plots presented below Before and After Outlier treatment presented in (Fig 2.1) & (Fig 2.2)
Fig 2. 1- Box plot BEFORE Outlier treament
Fig 2. 2- Box plot AFTER Outlier treament
Page 15 of 79
2.5 CHECK FOR MULTICOLLINEARITY

Defnition of Multicollinearity:-
Mutlicollinearity occurs when the independent variables of a regression model are correlated and if the
degree of collinerarity between independent variables is high, it becomes difficult to estimate the
relationship between each independent variable and the dependent variable and the overall precision
of the estimated coefficients.
Disadvantages of Multicollinearity:-
For Regression Multicolinearity is a problem because
a. If two independent variables contain essentially same information to a large extent, one may
become insignificant (or) may become significant
b. Unstable estimates as it tends to increase the variances of regression coefficients
Advantages of Multicollinearity:-
For PCA (Principal Component Analysis) and FA (Factor Analysis) multicollinearity is an advantage as
it helps to reduce the dimension of the variables since the variables are correlated.
How to assess the presence of Multicollinearity?

One way to assess Multicollinearity is to obtain Variance Inflation Factor (VIF). If VIF > 5 Indicates
presence of Multicollinearity. Other way to identify is to do the correlation graphical study to idenfity
Multi collinearity.
# Numerical Vs Numerical#
Following Multi-variate analysis were performed for this data set represented in (Fig 2.3) & (Fig 2.4)
a. Correlation Study
b. Multicollinearity Checks
Fig 2. 3- Correlation Plot Numeric variables- By Indicators
Page 16 of 79
Fig 2. 4- Correlation Plot Numeric variables- By Numbers
Inference: -
Correlation study using Correlation plot shows presence of correlated independent variable.
- Benefit per order, order item profit ratio, Order Profit per order are highly correlated.
- Sales is highly correlated with Sales per customer, Order Item total.
- Order item product price is highly correlated with Product price.
- Sales per customer and Sales are highly correlated.
There are correlated predictor/independent variables existing in this data set, which will lead to
situation of Multicollinearity, which may impact accuracy of the prediction and the coefficients to
identify the variable importance.
The suggested remedial measures was to treat Multi Collinearity, methods of treatment are :-
 Remove some of highly correlated variables using VIF
 Standardise the values by subtracting the means
 Can perform PCA (Principal Component Analysis) /FA (Factor Analysis) to
reduce the dimension of correlated independent variables.
For this data set clustering technique PCA/FA was performed to reduce the dimension of correlated
independent variables which is covered in next section- Data Preparation.
Correlation Study using Chi Square for Categorical Variables: -

Correlation study was conducted for Categorical variables using Chi square test and found correlated
categorical variables as presented in (Table 2.2)
Page 17 of 79
Table 2. 2 - Correlation Study Categoric variables- Chi Square Test

Order City Vs Order Country
> chisq.test(tab2)
Pearson's Chi-squared test
data: tab2
X-squared = 29060235, df = 586148, p-value < 2.2e-16
P is low, one of the variable could be dropped
Order Country Vs Order Region
chisq.test(tab3)
data: tab3
Order Country Vs Order State
> chisq.test(tab4)
data: tab4
2.6 DATA PREPARATION – FEATURE SCALING, BALANCING AND CLUSTERING

Feature Scaling:-
Why Feature Scaling Needed?
Machine learning is like making a mixed fruit juice. If we want to get the best-mixed juice, we need to
mix all fruit not by their size, but based on their right proportion. We just need to remember apple and
strawberry are not the same unless we make them similar in some context to compare their attribute.
Similarly, in many machine learning algorithms, to bring all features in the same standing, we need to
do scaling so that one significant number doesn’t impact the model just because of their large
magnitude.
Feature scaling in machine learning is one of the most critical steps during the pre-processing of data
before creating a machine learning model. Scaling can make a difference between a weak machine
learning model and a better one.
The most common techniques of feature scaling are Normalization and Standardization. Normalization
is used when we want to bound our values between two numbers, typically, between [0,1] or [-1,1].
While Standardization transforms the data to have zero mean and a variance of 1, they make our
data unit less.
Machine learning algorithm just sees number — if there is a vast difference in the range say few
ranging in thousands and few ranging in the tens, and it makes the underlying assumption that higher
ranging numbers have superiority of some sort. So these more significant number starts playing a
more decisive role while training the model. The machine learning algorithm works on numbers and
does not know what that number represents. A weight of 10 grams and a price of 10 dollars represents
completely two different things — which is a no brainer for humans, but for a model as a feature, it
treats both as same.
Page 18 of 79
Some examples of algorithms where feature scaling matters are:
1. K-Nearest Neighbour (KNN) with a Euclidean distance measure is sensitive to magnitudes and
hence should be scaled for all features to weigh in equally.
2. Scaling is critical while performing Principal Component Analysis(PCA). PCA tries to get the
features with maximum variance, and the variance is high for high magnitude features and skews
the PCA towards high magnitude features
3. Helps to speed up gradient descent by scaling because θ descends quickly on small ranges and
slowly on large ranges, and oscillates nefficiently down to the optimum when the variables are
very uneven
Algorithms that do not require normalization/scaling are the ones that rely on rules. They would not
be affected by any monotonic transformations of the variables. Scaling is a monotonic transformation.
Examples of algorithms in this category are all the tree-based algorithms — CART, Random Forests,
Gradient Boosted Decision Trees. These algorithms utilize rules (series of inequalities) and do not
require normalization.
Scaling was performed to numerical data subset and output of scaling reflected below (Table 2.3)
Table 2. 3- Scaled- Numeric Variables output
Data Balancing :-
What is Balanced and Imbalanced Datasets?
Balanced dataset:
Let us take simple example in a dataset we have positive and negative values. If the positive values
and equal to negative values, then we can say the dataset is balanced
Imbalanced dataset:
In the same example if there is very high difference between positive and negative values then the
data set is imbalance data set
In the Data Co. data set the distribution of target/dependent variable distribution is 0’s (no risk of
late delivery) – 45.16% 1’s (Late delivery risk) – 54.84%. Hence this data is Balanced dataset.
It is noteworthy this would be the baseline i.e. without model/algorithms DataCo. Company knows
from the existing that data that 54.84% is late delivery risk.
Page 19 of 79
Clustering using PCA/FA: -

Clustering is the task of dividing the population (or) data points into a number of groups such that data
points in the same groups are more similar to other data points in the same group than those in other
groups. In simple words, the aim is to segregate groups with similar traits and assign them into
clusters.
Let’s understand this with an example: Suppose, you are the head of a rental store and wish to
understand preferences of your costumers to scale up your business. Is it possible for you to look at
details of each costumer and devise a unique business strategy for each one of them? Definitely not.
But, what you can do is to cluster all of your costumers into say 10 groups based on their purchasing
habits and use a separate strategy for costumers in each of these 10 groups. And this is what we call
clustering.
For this data we used the clustering methodology doing PCA-FA to address the multicolinearity (that
was discussed in the previous section) to reduce the dimensionality.
Fig 2. 5 - Scree Plot – Eigen Values of Components
In above (Fig- 2.5) Eigen values as output of PCA-FA was plotted called the Scree plot.
The Elbow Bend Rule:-

The spot/point of Scree Graph where it levels to the right as Elbow is 6. Hence 6 factors is a good
choice. We used 6 Components to perform a FA – Factor Analysis.
Factor analysis using the FA method yeilds the below results, which is unrotated i.e. the Factors are
Orthagonal to each other given in the below (Table 2.4)
Table 2. 4- Scaled- Numeric Variables output
Page 20 of 79
Table 2.4 - Scaled- Numeric Variables output
Interpretation:-
The first 6 factors explains 79% of the variance i.e. we can reduce the dimension from 15 to 6, while
losing 21% of variance. Factor 1 accounts for 33%, Factor 2 accounts for 17%, Factor 3 – 12%,
Factor 4- 11% variance, Factor 5 and 6 both account for 9% of variance
Further the FA could be studied visually through FA diagram represented below in (Fig 2.6) and the
respective Labels of the factors is presented in (Table 2.5)
Fig 2. 6 - FA Diagram – Rotation None
Page 21 of 79
Labeling and interpretation of the Factors:-

- MR1 -Sales, Sales per customer, Order item Total, Product Price, Order Item product price are
highly correlated independent variables. All this variables could be combined to one Factor called
“Revenue”
- MR2- Order Profit per order, Benefit profit per order, Order item profit ratio can be combined as
“Profit”
- MR3 - Order item quantity, Order item discount are item related can be combined as “Quantity”
- MR4 - Order item discount rate can be named as “Discount”
- MR5- Latitude and Longitude are geospatial variables can be combined as “Location”
- MR6- Days for shipment scheduled, Days for shipment real can be combined as “Schedule”
Table 2. 5 - Factors interpretation with labels
Factor Variables Label Short Interpretation

s
MR1 Sales per customer, Order Item total, Sales, Revenue These are related to sales generated
Product Price, Order Item Product Price (5 hence labelled as Revenue
variables)
MR2 Order item profit ratio, Benefit per order, Order Profit These are related to profits generated
Profit per order (3 variables) hence labelled as Profit
MR3 Order Item Quantity, Order Item discount (2 Quantity These are related item quantity hence
variables) labelled as Quantity
MR4 Order Item Discount Rate (1 Variable) Discount These are related to Discounts provided,
hence labelled as Discount
MR5 Latitude and Longitude (2 Variables) Location Geospatial variables hence labelled as
Location
MR6 Days for shipment scheduled, Days for shipment Schedul Both the variables are days of shipment
real (2 variables) e hence labelled as Schedule
2.7 VARIABLE TRANSFORMATION

When a categorical variable has more than two categories, it can be represented by a set of dummy
variables, with one variable for each category for the algorithm to function. We identified Character
variables which had more than 2 categories, which were transformed in to dummies using R Package-
Model Matrix function.
2.8 FEATURE SELECTION

Objective of this step is to identify how variability in the continuous features/variable is observed with
respect to dependent feature/variable i.e. Late delivery risk. One way of profiling continuous feature is
through deciling i.e. Features with less than 4-5 deciles can be omitted as these features are not
showing any variability with respect to dependent variable & hence might not have any effect on the
model. For categorical feature what is the distribution of dependent variable class across each level.
Page 22 of 79
Deciling method was used with the features (or) Factors identified from the previous step to assess the
distribution of feature and found all Factors have good distribution up to 10 deciles, hence considered
for as variable for model building. The output of the deciles are shown below
MR1- Revenue: -
There are 10 deciles, hence the variable represents significant distribution and good predictor.
MR2- Profit: -
MR3- Quantity: -
MR4- Discount: -
MR5- Location: -
MR6- Schedule: -
Categorical Features: -
Referring to Bi variate analysis between categorical independent features and dependent categorical
feature, differences observed. However, some of them are correlated, hence uncorrelated categorical
variables will be selected for the model building.
The selected features/variables along with dependent variable i.e. Late delivery risk was split in to
Train and Test data on 70/30 ratio.
2.9 EDA – DATA PREPARATION SUMMARY

Summary of the above steps of EDA, Data preparation, Data Balancing, Clustering and Feature
selection is summarised in (Fig 2.7) below.
Fig 2. 7-EDA- Data Preparation, Clearning, Imputation- Summary
Page 23 of 79
Please refer Appendix A for Source Code
SECTION 3: ALL MODEL DEVELOPMENT INCLUDING TESTING OF

ASSUMPTIONS AND PERFORMANCE EVALUATION METRICS
The objective of the model development is to build appropriate prediction model on the train data and
apply the predicted train model on the test data to find its robustness in maintaining the correctness of
the prediction.
The predictive models that were built for this case study are using Logistic Regression, Naive Bayes,
KNN of predictive model techniques. Ensemble methods like Bagging and Boosting were also used to
create models, post the model development interpretation of the model outputs, necessary
modifications like tuning the parameters were done to find the optimal model outputs..
The output/results of all the models were evaluated using model performance validation techniques
like Confusion Matrix, ROC, AUC, GINI index (whereever applicable) and the scores were compared
to arrive at the best performed model that can predict Late Delivery Risk.
Overfitting it impact & Sample Split Purpose:
In statistical machine learning techniques there is problem of data overfitting i.e. Overfitting a model
is a condition where a statistical model begins to describe the random error in the data rather than
the relationships between variables. This problem occurs when the model is too complex. The
problem of overfitting can be avoided by spitting the data in to Training and Test data.
To explain the overfitting bit further, for example- Let us consider that you want to teach your dog a
few tricks - sit, stay, roll over. You can achieve this by giving the command and showing your dog what
the dog needs to do when you say this command i.e. training data. If you provide your dog with
enough clear instructions on what he is supposed to learn, your dog might reach a point where he
obeys your command almost every time i.e. high training accuracy. You can brag in a dog show may
be that your dog can perform a lot of tricks. However, will your dog do the correct thing in the show if
you give the command i.e. testing data? If your dog rolls over when the instruction in the show is to sit,
it might mean that your dog is only good at performing a trick when you i.e. training data give the
command - low testing accuracy. This is an example of overfitting.
The reasons for why your dog only responds in the correct manner when you give the command can
vary, but it comes down to your training data.
If the training accuracy is high, but the testing accuracy is low, the model cannot be advertised as a
good model. Testing data allows you to test your model on data that is independent of your training
data. If the model is actually a good model i.e. performing the correct command in this case, it
should perform just as well on training data as well in the testing data.
Page 24 of 79
This report covers the model build and evaluation that were performed with Train and Test data that
was produced from Section 2. This report is covers the model build and evaluation in below
sequence:-
3.1 Applying Logistic Regression, Model Tuning, Model Evaluation & Interpret results
3.2 Applying Naive Bayes, Model Tuning, Model Evaluation & Interpret results
3.3 Applying KNN – K Nearest Neighbour Model, Model Tuning, Model Evaluation & Interpret results
3.4 Applying CART, Model Tuning, Model Evaluation & Interpret results
3.5 Appying Random Forest, Model Tuning, Model Evaluation & Interpret results
3.6 Applying Bagging method, Model Tuning, Model Evaluation & Interpret results
3.7 Applying Boosting method Model Tuning, Model Evaluation & Interpret results
3.8 Model Validation to find which above model performed the best
3.1 APPLYING LOGISTIC REGRESSION, MODEL TUNING, MODEL EVALUATION

& INTERPRET RESULTS
What is Logistic Regression & Purpose:
Logistic Regression is a statistical model that in its basic form uses a logistic function (in statistics
logistics model (or logit model) is used to model the probability of a certain class (or) event such as
pass/fail, win/lose to a model of binary dependent variable). In regression analysis, logistic regression
is estimating the parameters of a logistic model (a form of binary regression). Mathematically, a binary
logistic model has a dependent variable with two possible values e.g. pass/fail, where the two values
are labelled as “0” and “1”.
In the logistic model, the log odds for the value labelled 1 is a linear combination of one (or) more
independent variables or predictors. The independent variables can be binary (or) continuous
variables. The corresponding probability of the value labelled "1" can vary between 0 (certainly the
value "0") and 1 (certainly the value "1"), hence the labelling; the function that converts log-odds to
probability is the logistic function, hence the name.
The unit of measurement for the log-odds scale is called a logit, from logistic unit, hence the
alternative names.
The Algorithm: -
Logistic Regression is a part of a larger class of algorithm called the Generalized Linear Model (glm). It
is a classification algorithm used to predict binary outputs. One of the reason for Logistic regression to
be used is to get the probabilities of occurrences meaning the 0 < p < 1. The probability does not vary
linearly.
Logistic Regression with given data set:
In the data preparation step from previous section (Section 2) we split the data in to Train and Test
data sample and the proportion of target variable identified to be balanced data.
Logistic Regression was applied to the Training data to build the model and the model that was
prepared with Train data was applied to Test data to derive the predictions.
There are multiple approaches in constructing regression models which are: -
a. Forward selection, which involves starting with no variables in the model, testing the addition of
each variable using a chosen model fit criterion, adding the variable (if any) whose inclusion gives the
Page 25 of 79
most statistically significant improvement of the fit, and repeating this process until none improves the
model to a statistically significant extent.
b. Backward elimination, which involves starting with all candidate variables, testing the deletion of
each variable using a chosen model fit criterion, deleting the variable (if any) whose loss gives the
most statistically insignificant deterioration of the model fit, and repeating this process until no further
variables can be deleted without a statistically insignificant loss of fit.
c. Bidirectional elimination, a combination of the above, testing at each step for variables to be
included or excluded.
In this dataset approach c i.e. Bidirectional approach was followed to construct the logistic
regression model.
Assumptions of Logistic Regression: -
There are certain key assumptions that Logistic regression as a model carries, which were to be
considered for the model building i.e. Logistic regression does not make many of the key assumptions
of linear regression and general linear models that are based on ordinary least squares algorithms –
particularly regarding linearity, normality, homoscedasticity, and measurement level as defined below.
1) Logistic regression does not require a linear relationship between the dependent and
independent variables.
2) The error terms (residuals) do not need to be normally distributed.
3) Homoscedasticity is not required.
4) The dependent variable in logistic regression is not measured on an interval or ratio scale.
However, some assumptions still apply, which are:
1. Binary logistic regression requires the dependent variable to be binary

2. Logistic regression requires the observations to be independent of each other. In other words,
the observations should not come from repeated measurements or matched data.
3. Logistic regression requires there to be little or no multicollinearity among the independent
variables. This means that the independent variables should not be too highly correlated with
each other.
4. Logistic regression assumes linearity of independent variables and log odds. although this
analysis does not require the dependent and independent variables to be related linearly, it requires
that the independent variables are linearly related to the log odds.
5. Logistic regression typically requires a large sample size
We built various models with the Train data set using the Bidirectional approach, which is detailed
below:
# LR Model 2 – Check the Predictor Type influence on the dependent variable: -

The model was built using character predictor- Type which was converted to dummy, constructing
logistic regression and output of this model shown below.
> LRmodel2 <- glm(Late_delivery_risk ~ TypeCASH +TypeDEBIT + TypePAYMENT +

TypeTRANSFER, data = SCM_train , family= "binomial")
> summary(LRmodel2)
Call:
glm(formula = Late_delivery_risk ~ TypeCASH + TypeDEBIT + TypePAYMENT +
TypeTRANSFER, family = "binomial", data = SCM_train)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.310 -1.302 1.051 1.058 1.203
Coefficients: (1 not defined because of singularities)

Estimate Std. Error z value Pr(>|z|)
Page 26 of 79
(Intercept) -0.06044 0.01068 -5.661 1.51e-08 ***

TypeCASH1 0.34262 0.02031 16.870 < 2e-16 ***
TypeDEBIT1 0.34874 0.01408 24.761 < 2e-16 ***
TypePAYMENT1 0.36589 0.01595 22.941 < 2e-16 ***
TypeTRANSFER1 NA NA NA NA
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 173996 on 126362 degrees of freedom

Residual deviance: 173207 on 126359 degrees of freedom
AIC: 173215
Number of Fisher Scoring iterations: 4
Inference: -
There are only 3 significant variables identified can be considered to final model, but the variable
TypeTRANSFER1 was correlated with others, hence can be ignored.
# LR Model 3 - Check the Predictor Market influence on the dependent variable: -

The model was built using character predictor- Market which was converted to dummy, constructing
logistic regression and output of this model shown below
> LRmodel3 <- glm(Late_delivery_risk ~ MarketAfrica + MarketEurope
+MarketLATAM+MarketPacific.Asia +MarketUSCA, data = SCM_train , family=
"binomial")
> summary(LRmodel3)
Call:
glm(formula = Late_delivery_risk ~ MarketAfrica + MarketEurope +
MarketLATAM + MarketPacific.Asia + MarketUSCA, family = "binomial",
data = SCM_train)
Deviance Residuals:
-1.270 -1.257 1.088 1.100 1.102

(Intercept) 0.178796 0.014971 11.943 <2e-16 ***
MarketAfrica1 0.002540 0.026775 0.095 0.9244
MarketEurope1 0.035271 0.018424 1.914 0.0556 .
MarketLATAM1 0.006741 0.018323 0.368 0.7129
MarketPacific.Asia1 0.013547 0.019067 0.710 0.4774
MarketUSCA1 NA NA NA NA
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

AIC: 174000
This variable is insignificant and can be removed completely
# LR Model 4- Check the Predictor Shipping Mode influence on the dependent variable: -
The model was built using character predictor- Shipping Mode which was converted to dummy,
constructing logistic regression and output of this model shown below
LRmodel4 <- glm(Late_delivery_risk ~ Shipping.ModeFirst.Class +
Shipping.ModeSame.Day +Shipping.ModeSecond.Class
+Shipping.ModeStandard.Class, data = SCM_train , family= "binomial")
> summary(LRmodel4)
Page 27 of 79
Call:
glm(formula = Late_delivery_risk ~ Shipping.ModeFirst.Class +
Shipping.ModeSame.Day + Shipping.ModeSecond.Class +
Shipping.ModeStandard.Class,
family = "binomial", data = SCM_train)
Deviance Residuals:
-2.4734 -0.9796 0.3101 1.2429 1.3890

(Intercept) -0.484872 0.007493 -64.71 <2e-16 ***
Shipping.ModeFirst.Class1 3.495643 0.034698 100.75 <2e-16 ***
Shipping.ModeSame.Day1 0.332122 0.025371 13.09 <2e-16 ***
Shipping.ModeSecond.Class1 1.670486 0.016841 99.19 <2e-16 ***
Shipping.ModeStandard.Class1 NA NA NA NA
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

AIC: 143907
Inference: -
There are only 3 significant variables identified can be considered to the final model, but the variable
Shipping modeStandard Class was correlated with others, hence can be ignored.
# LR Model 5- Check the Predictor Order status influence on the dependent variable: -
The model was built using character predictor- Order Status which was converted to dummy,
constructing logistic regression and output of this model shown below
> summary(LRmodel5)
Coefficients:
(Intercept) 5.618e+09 7.215e+09 0.779 0.436
Order.StatusCANCELED1 -5.618e+09 7.215e+09 -0.779 0.436
Order.StatusCLOSED1 -5.618e+09 7.215e+09 -0.779 0.436
Order.StatusCOMPLETE1 -5.618e+09 7.215e+09 -0.779 0.436
Order.StatusON_HOLD1 -5.618e+09 7.215e+09 -0.779 0.436
Order.StatusPAYMENT_REVIEW1 -5.618e+09 7.215e+09 -0.779 0.436
Order.StatusPENDING1 -5.618e+09 7.215e+09 -0.779 0.436
Order.StatusPENDING_PAYMENT1 -5.618e+09 7.215e+09 -0.779 0.436
Order.StatusPROCESSING1 -5.618e+09 7.215e+09 -0.779 0.436
Order.StatusSUSPECTED_FRAUD1 -5.618e+09 7.215e+09 -0.779 0.436

AIC: 369522
This variable is insignificant and can be removed completely
# LR Model: -
This model was constructed with significant variables/predictors identified from before steps.
#LR Model1 with Selected Predictors:-
Page 28 of 79
> LRmodel <- glm(Late_delivery_risk ~ Revenue + Profit + Quantity +

Discount + Location + Schedule + TypeCASH +TypeDEBIT +
+ TypePAYMENT + Shipping.ModeFirst.Class +
Shipping.ModeSame.Day +Shipping.ModeSecond.Class,
+ data = SCM_train , family= "binomial")
> summary(LRmodel)
Call:
glm(formula = Late_delivery_risk ~ Revenue + Profit + Quantity +
Discount + Location + Schedule + TypeCASH + TypeDEBIT + TypePAYMENT +
Shipping.ModeFirst.Class + Shipping.ModeSame.Day +
Shipping.ModeSecond.Class,
Deviance Residuals:
-5.3030 -0.0467 0.0049 0.1949 1.2090
Coefficients:
(Intercept) -12.328521 0.092342 -133.509 < 2e-16 ***
Revenue 0.036371 0.013782 2.639 0.00831 **
Profit -0.105957 0.013750 -7.706 1.3e-14 ***
Quantity -0.007623 0.014386 -0.530 0.59618
Discount 0.118884 0.014440 8.233 < 2e-16 ***
Location 5.808677 0.044515 130.488 < 2e-16 ***
Schedule 14.794056 0.106225 139.271 < 2e-16 ***
TypeCASH1 1.952686 0.053587 36.440 < 2e-16 ***
TypeDEBIT1 1.962636 0.036125 54.329 < 2e-16 ***
TypePAYMENT1 1.944444 0.041642 46.694 < 2e-16 ***
Shipping.ModeFirst.Class1 31.867099 0.213845 149.020 < 2e-16 ***
Shipping.ModeSame.Day1 40.673871 0.295945 137.437 < 2e-16 ***
Shipping.ModeSecond.Class1 19.857350 0.146058 135.955 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

AIC: 37490

Most of the variables were significant. A variable Importance test was run to identify their Importance
> varImp(object = LRmodel)
Overall
Revenue 2.6391096
Profit 7.7060740
Quantity 0.5299008
Discount 8.2329316
Location 130.4875722
Schedule 139.2711842
TypeCASH1 36.4398849
TypeDEBIT1 54.3294070
TypePAYMENT1 46.6937709
Shipping.ModeFirst.Class1 149.0196347
Shipping.ModeSame.Day1 137.4371399
Shipping.ModeSecond.Class1 135.9553236
Quantity found to be Insignificant with importance of 0.52, hence was removed from the final model.
# LR DRAFT Model: -
This is DRAFT model constructed with significant and important variables/predictors that were
identified from before steps and later VIF TEST was run to identify for the presence of multicollinearity
amongst independent variables.
Summary output of the DRAFT Model.
Page 29 of 79
> summary(LRmodel_Draft)
Call:
glm(formula = Late_delivery_risk ~ Revenue + Profit + Discount +
Location + Schedule + TypeCASH + TypeDEBIT + TypePAYMENT +
Deviance Residuals:
-5.3041 -0.0467 0.0049 0.1950 1.2027
Coefficients:
(Intercept) -12.32808 0.09233 -133.515 < 2e-16 ***
Revenue 0.03648 0.01378 2.648 0.0081 **
Profit -0.10596 0.01375 -7.705 1.31e-14 ***
Discount 0.11932 0.01442 8.274 < 2e-16 ***
Location 5.80852 0.04451 130.491 < 2e-16 ***
Schedule 14.79351 0.10622 139.278 < 2e-16 ***
TypeCASH1 1.95264 0.05359 36.439 < 2e-16 ***
TypeDEBIT1 1.96259 0.03612 54.328 < 2e-16 ***
TypePAYMENT1 1.94437 0.04164 46.693 < 2e-16 ***
Shipping.ModeFirst.Class1 31.86607 0.21383 149.027 < 2e-16 ***
Shipping.ModeSame.Day1 40.67226 0.29592 137.445 < 2e-16 ***
Shipping.ModeSecond.Class1 19.85662 0.14605 135.962 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

AIC: 37488

VIF Test
A test of VIF-Variable Inflation Factor to test the presence of multicollinearity for LR Model 2 yielded
below result
This shows the presence of multicollinearity between Schedule, Shipping Mode. Hence Schedule
was removed and another model LR model was built, which was the final model.
# LR FINAL1 Model: -
This is the Final model constructed with significant and important variables/predictors which are not
correlated amongst each other that were identified from before steps and later again VIF TEST was
run to identify for the presence of multicollinearity.
Summary output of the FINAL1 Model.
> summary(LRmodel_FINAL1)
Call:
glm(formula = Late_delivery_risk ~ Revenue + Profit + Discount +
Location + TypeCASH + TypeDEBIT + TypePAYMENT +
Shipping.ModeFirst.Class +
Shipping.ModeSame.Day + Shipping.ModeSecond.Class, family = "binomial",
data = SCM_train)
Deviance Residuals:
Page 30 of 79

-2.5175 -0.9596 0.2904 1.0289 1.8217
Coefficients:
(Intercept) -0.907046 0.013453 -67.423 <2e-16 ***
Revenue -0.009930 0.006464 -1.536 0.1245
Profit -0.010064 0.006466 -1.556 0.1196
Discount 0.013869 0.006735 2.059 0.0395 *
Location 0.389395 0.008287 46.991 <2e-16 ***
TypeCASH1 0.450136 0.023153 19.442 <2e-16 ***
TypeDEBIT1 0.455922 0.016196 28.151 <2e-16 ***
TypePAYMENT1 0.472174 0.018228 25.904 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

AIC: 140652

Analysis of Coefficients and their Signs
Coefficients of the variables can be studied as below:-
 In the output above, first thing we see is the call, this is R reminding us what the model we ran
was, what options we specified
 Next we see the deviance residuals, which are a measure of model fit. This part of output
shows the distribution of the deviance residuals for individual cases used in the model.
 The next part of the output shows the coefficients, their standard errors, the z-statistic
(sometimes called a Wald z-statistic), and the associated p-values.
Location, TypeCASH, TypeDEBIT, TypePAYMENT, Shipping Mode First Class, Same Day,
Second Class are statistically significant, as are the three terms for rank. Revenue, Profit and
discount have least ranks, hence less significant.
The logistic regression coefficients give the change in the log odds of the outcome for a one
unit increase in the predictor variable example.
o For every one unit change in Profits, the log odds of Late delivery decreases (since
the sign is Negative) by 0.010064 (i.e. -0.010064). Can also be interpreted as faster
delivery = positive profits.
o For a one-unit increase in Location, the log odds of Default increases (since the sign
is positive) by 0.389395 0.389395. The Location proximity is key to avoid late delivery.
o For a one-unit increase in Sameday, the log odds of Default increases (since the sign
is positive) by 0.695076 0.6695076. More the same day delivery possibility of late
delivery is high, hence limit orders for same day delivery
o For a one-unit increase in First Class the log odds of Default increases (since the
sign is positive) by 3.802061 3.802061. More the First class delivery possibility of late
delivery is high, hence limit orders of First class
Similarly, other variable coefficients can be interpreted. The table of coefficients are fit indices,
including the null and deviance residuals and the AIC in the bottom of the summary
VIF Test
A test of VIF-Variable Inflation Factor to test the presence of multicollinearity for LR Model 2 yielded
below result
Page 31 of 79
VIF is less than 2, hence the correlation amongst the independent variables are low.
Profiling: -
We can use the confint function to obtain confidence intervals for the coefficient estimates. Note that
for logistic models, confidence intervals are based on the profiled log-likelihood function
2.5 % 97.5 %
(Intercept) -0.9334482715 -0.880712656
Revenue -0.0225995839 0.002737279
Profit -0.0227368997 0.002609933
Discount 0.0006689307 0.027071384
Location 0.3731634302 0.405646370
TypeCASH1 0.4047626039 0.495523418
TypeDEBIT1 0.4241919634 0.487678459
TypePAYMENT1 0.4364595470 0.507910977
Shipping.ModeFirst.Class1 3.7329512689 3.872396435
Shipping.ModeSame.Day1 0.6422843537 0.747852560
Shipping.ModeSecond.Class1 1.7852086671 1.853679721
Model Prediction (Train model on the Test Data)

With the above Logistic Regression model built on the train data we did prediction of train data and
later on the test data i.e. if we have to randomly pick an element, what would be its classification with
respect to Late Delivery Risk and what is the probability score associated with the prediction. The
threshold values of prediction scores can be adjusted (as tuning parameter) to improve the accuracy
of prediction, results are shows in (Table 3.1 & 3.2)
> SCM_Pred_Train = predict(LRmodel_FINAL1, newdata= SCM_train ,
type="response")
> summary(SCM_Pred_Train)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.1668 0.3597 0.4781 0.5483 0.7702 0.9909
> # Confusion Matrix -Train
> CM_SCM_Train = table(SCM_train$Late_delivery_risk, SCM_Pred_Train>0.5)
> CM_SCM_Train
FALSE TRUE
0 48808 8271
1 23523 45761
> # Prediction of Test data using the Train Model

> SCM_Pred_Test = predict(LRmodel_FINAL1, newdata= SCM_test ,
type="response")
> summary(SCM_Pred_Test)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.1705 0.3608 0.4787 0.5500 0.7710 0.9900
> CM_SCM = table(SCM_test$Late_delivery_risk, SCM_Pred_Test>0.5)
> CM_SCM
FALSE TRUE
0 20884 3579
1 10024 19669
Table 3. 1– Logistic Regression- Confusion Matrix-Train Data
Logistic Regression- Confusion Matrix- Train Data
Page 32 of 79
Prediction>0.5 LR-Predict
0- No Late Delivery 1- Late Delivery
0- No Late Delivery 48808 8271
Late Delivery 1- Late Delivery 23523 45761
Accuracy- 74.84%, Sensitivity (or) Recall- 66.05%, Specificity-85.51%, Precision –
84.69%
Table 3. 2 – Logistic Regression- Confusion Matrix-Test Data

Logistic Regression- Confusion Matrix- Test Data
84.61%
The prediction variation of the model on train Vs Test data is less than 5% the model can be
considered to be optimal. There is possibility of fine tuning the model further by considering only the
important variables of significance.
Model Tuning
The model was fine-tuned taking only Significant predictors and removing not significant predictors
that were identified in the final model i.e. Revenue, Profit and Discount were removed and the
outcome of the model below.
> summary(LRmodel_FINAL2)
Call:
glm(formula = Late_delivery_risk ~ Location + TypeCASH + TypeDEBIT +
TypePAYMENT + Shipping.ModeFirst.Class + Shipping.ModeSame.Day +
Shipping.ModeSecond.Class, family = "binomial", data = SCM_train)
Deviance Residuals:
-2.5110 -0.9612 0.2909 1.0312 1.8130
Coefficients:
(Intercept) -0.906838 0.013452 -67.41 <2e-16 ***
Location 0.389347 0.008286 46.99 <2e-16 ***
TypeCASH1 0.449965 0.023153 19.43 <2e-16 ***
TypeDEBIT1 0.455915 0.016195 28.15 <2e-16 ***
TypePAYMENT1 0.472212 0.018227 25.91 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

AIC: 140655

VIF Test:-
Page 33 of 79
> CM_SCM_Train_FT
FALSE TRUE
0 49348 7731
1 22880 46404
Table 3. 3 – Logistic Regression Tuned- Confusion Matrix-Train Data

Logistic Regression- Confusion Matrix- Train Data-Only Significant
85.64%
CM_SCM_FT
FALSE TRUE
0 21117 3346
1 9740 19953
Table 3. 4 – Logistic Regression Tuned- Confusion Matrix-Test Data
Logistic Regression- Confusion Matrix- Test Data-Only Significant

Late 0- No Late Delivery 21117 3346
Delivery 1- Late Delivery 9740 19953
Accuracy- 75.84%, Sensitivity (or) Recall- 67.20%, Specificity- 86.32%, Precision –
85.64%, F Measure- 75.57%
There is 1% improvement in the results, however no big differences observed, hence the other options
to improve the model from here is to adjust the prediction variation, but this would lead in to trade-off
between Recall and Precision. Hence, no further fine tuning was done.
Model Evaluation: -
The measure of performance for predictive models – Logistic Regression, as evaluated
through below methods: -
a. Confusion Matrix: - For the class output from the models classification error of
predicted vs actuals were drawn to understand the Accuracy- The ratio of
classifications that were done correctly and Sensitivity i.e. (proportion of total
positives that were correctly identified of the model.
Page 34 of 79
b. ROC/AUC Curves- With probability outputs of the prediction ROC curve- Receiver
Operating characteristic was drawn.
Confusion matrix and Interpretation
When the train model was applied in test data with threshold of 0.5, below were the results.
Table 3. 5 – Logistic Regression Tuned- Confusion Matrix-Test Data
Logistic Regression- Confusion Matrix- Test Data-Only Significant Variables

Late 0- No Late Delivery 21117 3346
Delivery 1- Late Delivery 9740 19953
85.64%, F Measure- 75.57%
The final performance from the results of Logistic regression model presented below: -
Table 3. 6 – Logistic Regression Tuned- Final Results-Test Data

LR-SCM- Test Data Evaluation Parameters
Model Accuracy Sensitivity/Recall Specificity Precision F Measure
Logistic Regression
Fine Tuned 75.84% 67.20% 86.32% 85.64% 75.57%
Interpretation: -
The Logistic regression model has given has given accuracy of 75.84% with Recall of 85.64% and
Precision of 68.44%, F-Measure Harmonic Mean- 76.08%
----------------------------------------------------------------------------------------------------------------------------------------
Definition of evaluation parameters: -
Before we jump in to interpretation of the results, it is important to understand the what the measure
means, which is explained below.
Accuracy = Out of all cases how much did we correctly predict = (TP+TN)/(TP+TN+FP+FN)
Sensitivity/Recall = Out of all positive cases how many you are able to predict correctly i.e. how
good the test is detecting positive cases= TP / (TP +FN)
Specificity = Out of all negative cases predicted, how many are predicted correctly i.e. how good
the test is avoiding false alarms = TN / (TN+FP)
Precision = How many of the positively classified were relevant = TP/(TP+FP)
F Measure = Measure of Precision and Recall at same time = Harmonic Mean =
2*Recall*Precision / (Recall+Precision)
----------------------------------------------------------------------------------------------------------------------------------------
Page 35 of 79
Sensitivity/Recall is i.e. model is able to spot the late delivery up to 67% and Specificity which non-late
delivery prediction is 86%. Though the sensitivity is low the precision at which positive cases are
identified is 85%
Since the objective is to reduce the Late delivery, Sensitivity/Recall of predicting True Positive (TP =
outcome where the model correctly predicts the positive class) is of prime importance. This model is
able to predict only 67% and false alarm is yet fine. The Model result found to be Satisfactory. The
model can be fine-tuned to improve the sensitivity by reducing the threshold, but this will impact the
Accuracy and Precision, hence the advice to business is to evaluate the business situation and adjust
the threshold to improve Sensitivity (or) Specificity
ROC/AUC/KS Charts
Logistic Regression- ROC/AUC Charts - Test Data
For classification problems with probability outputs, a threshold can convert probability outputs to
classifications. This choice of threshold can change the confusion matrix and as the threshold
changes, a plot of the false positive Vs true positive is called the ROC curve- Receiver Operating
Characteristic. AUC – Area under curve. It is one of the most important evaluation metrics for
checking any classification model’s performance. It is also written as AUROC- Area under the receiver
operating characteristics.
ROC is a probability curve and AUC represents degree or measure of separability. Higher the AUC,
better the model is at predicting 0s as 0s and 1s as 1s.
The results from the Logistic regression model was reviewed with ROC-AUC parameters and the
model evaluation presented in (Fig- 3.1)
Fig 3. 1 - Logistic Regression- ROC-AUC Charts
AUC:-
> as.numeric(performance(ROCRpred_LR, "auc")@y.values)
[1] 0.7751119
Interpretation: -
When the train model was applied to test data produced ROC curve curved towards the left (True
positive rate), this indicates good proportion of data expected to be predicted correctly.
The threshold range from 0 to 1, it indicates lower threshold may give better curve, we can retain the
0.5 threshold as it has yields TPR close to 67%. The thresholds can be lowered to 0.4 to improve
sensitivity, on the contrary it will impact the accuracy and specificity of the model, hence the advice to
Page 36 of 79
business is to evaluate the business situation and adjust the threshold to improve Sensitivity (or)
Specificity. AUC is 77.5%. The model results are satisfactory
3.2 APPLYING NAÏVE BAYES, MODEL TUNING, MODEL EVALUATION &

INTERPRET RESULTS
What is Naïve Bayes & Purpose:
Naïve Bayes is a simple technique for constructing classifiers: models that assign class label to
problem instances, represented as vectors of feature values, where the class labels are drawn from
some finite set.
This is not a single algorithm for training such classifiers, but a family of algorithms based on a
common principle: all naive Bayes classifiers assume that the value of a particular feature
is independent of the value of any other feature, given the class variable.
For example: A fruit may be considered to be an apple if it is red, round, and about 10 cm in diameter.
A naive Bayes classifier considers each of these features to contribute independently to the probability
that this fruit is an apple, regardless of any possible correlations between the colour, roundness, and
diameter features.
For some types of probability models, naive Bayes classifiers can be trained very efficiently in
a supervised learning setting. In many practical applications, parameter estimation for naive Bayes
models uses the method of maximum likelihood; in other words, one can work with the naive Bayes
model without accepting Bayesian probability or using any Bayesian methods.
Despite their naive design and apparently oversimplified assumptions, naive Bayes classifiers
have worked quite well in many complex real-world situations. An advantage of naive Bayes is that it
only requires a small number of training data to estimate the parameters necessary for classification
The Algorithm: -
The Naïve Bayes works on the logic

1. For a given new record to be classified, find other records like it (i.e., same values for the predictors)
2. What is the prevalent class among those records?
3. Assign that class to your new record
The Naïve Bayes relies on finding other records that share same predictor values as record to be
classified. The Naïve Bayes requires categorical variables, numerical variables can be binned and
converted in to categorical variables, it can be used for very large data set.
Naïve Bayes with given data set:
Since there are more than 10 predictors variables and sample size needed in exponential, we would
need good amount of samples size. Since this data set has good amount of data, model was built with
training data and predicted on the test data
Naïve Bayes Algorithm was used on training data and to predict the test data yielded the following
results.
> NBmodel = naiveBayes(Late_delivery_risk ~ Revenue + Profit + Discount +

Location + Schedule + TypeCASH +TypeDEBIT +
Shipping.ModeSame.Day +Shipping.ModeSecond.Class, data = SCM_train)
> NBpredTest = predict(NBmodel, newdata = SCM_test)
> tabNB_test = table(SCM_test$Late_delivery_risk, NBpredTest)
> tabNB_test
NBpredTest
0 1
0 19839 4624
1 8675 21018
Model Evaluation: -
Page 37 of 79
The measure of performance for predictive models Naïve Bayes, evaluated through method of
Confusion Matrix: - For the class output from the models classification error of predicted vs actuals
were drawn to understand the Accuracy- The ratio of classifications that were done correctly and
Sensitivity i.e. (proportion of total positives that were correctly identified of the model.
Table 3. 7– Naive Bayes- Confusion Matrix on Test Data

Naïve Bayes - Confusion Matrix- Test Data
NB-Predict
81.97%, F Measure- 75.59%
Table 3. 8 – Naive Bayes- Confusion Matrix Tuned- Final Results-Test Data

NB-SCM- Test Data Evaluation Parameters
Model Accuracy Sensitivity/Recall Specificity Precision F-Measure
Naïve Bayes Model 75.44% 70.78% 81.10% 81.97% 75.59%
Interpretation: -
----------------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------------------------------------
Sensitivity/Recall is i.e. Model is able to spot the late delivery up to 71% and Specificity which non-late
delivery prediction is 81%.
outcome where the model correctly predicts the positive class) is of prime importance. Which the
model is able to predict 71% and false alarm is yet fine. The Model result is Satisfactory.
3.3 APPLYING KNN-K NEAREST NEIGHBOUR, MODEL TUNING, MODEL

EVALUATION & INTERPRET RESULTS
What is KNN & Purpose:
Page 38 of 79
KNN- also called as K Nearest Neighbour is a non-parametric, lazy learning algorithm. The very
purpose of KNN is to use a data base in which the data points are separated in to several classes to
predict the classification of a new sample point
When we say the technique is non-parametric, it means that it does not make any assumptions on the
underlying data distribution. In other words, the model structure is determined from the data. If you
think about it, it’s pretty useful, because in the “real world”, most of the data does not obey the typical
theoretical assumptions made (as in linear regression models, for example). Therefore, KNN could
and probably one of the good choices for a classification study when there is little or no prior
knowledge about the distribution data.
KNN requires size of training set that exponentially increases with the number of predictors. This is
because expected distance to nearest neighbour increases with p (with large vector of predictors, all
records end up “far away from each other). If the training set is long it takes time to find all distances.
This constitute to curse dimensionality
The Algorithm: -
KNN is also a lazy algorithm- What this means is that it does not use the training data points to do
any generalization. KNN Algorithm is based on feature similarity i.e. How closely out-of-sample
features resemble our training set determines how we classify a given data point as represented in
(Fig 3.2):
Fig 3. 2- KNN- Classification Method:
KNN is used for classification. The output is a class membership- predict class or a discrete value. An
object is classified by a majority of vote of the neighbours, with the object being assigned to the class
most common among its K neighbours. It can also be used for regression — output is the value for
the object (predicts continuous values). This value is the average (or median) of the values of its k
nearest neighbours.
KNN with given data set: Model Building and Model Tuning
This data was split in to Training and Testing Data and first the KNN model was applied to the Training
data and prediction was constructed on trial and error method of adjusting the K parameter (i.e.
tuning). The output of the KNN with various K parameters listed below:
We did KNN model with various K parameters with output as below.
# Model 1: K = 19
> # Model 1
> SCM.KNN. = knn (scale(SCM.train.num), scale(SCM.test.num), cl =
SCM_train[,1], k=19, prob = TRUE)#K is 19
Page 39 of 79
> # Model 1
> SCM.KNN = knn (scale(SCM.train.num), scale(SCM.test.num), cl =
> SCM.tabKNN = table(SCM_test$Late_delivery_risk, SCM.KNN)
> SCM.tabKNN
SCM.KNN
0 1
0 21911 2552
1 6542 23151
Table 3. 9 - KNN - Confusion Matrix Test Data- K = 19
KNN - Confusion Matrix- Test Data

K=19 KNN-Predict
90.07%, F Measure- 83.37%
# Model 2: K = 9
> # Model 2
> SCM.KNN2 = knn (scale(SCM.train.num), scale(SCM.test.num), cl =
> SCM.tabKNN2 = table(SCM_test$Late_delivery_risk, SCM.KNN2)
> SCM.tabKNN2
SCM.KNN2
0 1
0 21931 2532
1 5902 23791
Table 3. 10 – KNN - Confusion Matrix Test Data- K = 9

K=9 KNN-Predict
90.38%, F Measure- 84.62%
# Model 3: K = 29
> # Model 3
> SCM.KNN3 = knn (scale(SCM.train.num), scale(SCM.test.num), cl =
> SCM.tabKNN3 = table(SCM_test$Late_delivery_risk, SCM.KNN3)
> SCM.tabKNN3
SCM.KNN3
0 1
0 21738 2725
1 6758 22935
Table 3. 11– KNN - Confusion Matrix Test Data- K = 29
Page 40 of 79
K=29 KNN-Predict
89.38%, F Measure- 82.64%
Model Tuning: -
Increasing the K to 29 found to reduce the accuracy, sensitivity and precision, on the contrary
decreasing the K to 9 produced better results than K=19. Model 2 found to be improving the
accuracy, Sensitivity
Model Prediction (Train model prediction on Test Data)
With the above KNN model built on the training data we did model prediction on the test data train
data i.e. if we have to randomly pick an element in the node and what would be its classification with
respect to customer churn.
Various K Parameters were tried and finally concluded with reduction of K may improve the sensitivity,
we can reduce the K yet further, but this will include noise. Hence, recommendation is concluding
Model 2 as the optimal Model with K = 9
Model Evaluation: -
The measure of performance for predictive model KNN, evaluated through method of
Sensitivity i.e. (proportion of total positives that were correctly identified of the model).
Table 3. 12 – KNN - Confusion Matrix Tuned Model- Test Data- K = 9

K=9 KNN-Predict
90.38%, F Measure- 84.62%
Interpretation: -
When the train model was applied in test data with threshold of 0.5, below were the results.
Table 3. 13 – KNN - Confusion Matrix Tuned- Final Results-Test Data

KNN-SCM Evaluation Parameters
KNN Model- K = 9 84.43% 80.12% 89.65% 90.38% 84.62%
Interpretation: -
----------------------------------------------------------------------------------------------------------------------------------------
Page 41 of 79

----------------------------------------------------------------------------------------------------------------------------------------
Sensitivity/Recall is i.e. Model is able to spot the late delivery up to 80%% and Specificity which non-
late delivery prediction is 89%.
model is able to predict 71% and false alarm is yet fine. The Model result is Satisfactory. It is
noteworthy to understand that KNN works well for continuous variables.
3.4 APPLYING CART, MODEL TUNING, MODEL EVALUATION & INTERPRET

RESULTS
What is CART & Purpose:
CART is abbreviated as Classification and Regression Trees is Supervised (Supervised means the
Target to be achieved is known) Machine Learning Technique to build Prediction Model. These are
decision trees, that segment the data space in to smaller regions, which can be called as tree and end
node has a decision – either Classification (or) Regression.
The Algorithm:
The algorithm of constructing decision trees works top-down, by choosing a variable at each step that
best splits the set of items in the data. The success is measured by how similar the data inside the
node that is split. Hence, larger the impurity lesser the accuracy of the prediction.
CART with given data set:
This data was split in to Training and Testing Data and first the CART model was applied to the
Training data and CART tree was constructed on trial and error method before pruning. The output of
the CART tree displayed below (Fig 3.3): -
Fig 3. 3 - CART Tree Before Pruning
Page 42 of 79
The Tree is complex since there are many predictors, hence could not yield better visualisation
CP also called as Cost Complexity chart below for the above tree.
Classification tree:
rpart(formula = SCM_train$Late_delivery_risk ~ Revenue + Profit +
data = SCM_train, method = "class", control = r.ctrl)
Variables actually used in tree construction:

[1] Discount Location Profit
Revenue Schedule
Shipping.ModeFirst.Class
[7] Shipping.ModeSecond.Class TypeCASH TypeDEBIT
TypePAYMENT
Root node error: 57079/126363 = 0.45171
n= 126363
CP nsplit rel error xerror xstd

1 1.9997e-01 0 1.000000 1.000000 0.00309933
2 9.0331e-02 3 0.390319 0.390564 0.00237389
3 3.8788e-02 4 0.299988 0.300408 0.00213280
4 3.2516e-02 6 0.222411 0.222989 0.00187435
5 2.0402e-02 7 0.189895 0.183763 0.00171820
6 1.7765e-02 9 0.149092 0.150879 0.00156945
7 1.1948e-02 10 0.131327 0.133096 0.00148041
8 3.4514e-03 13 0.088789 0.090857 0.00123549
9 2.9345e-03 15 0.081887 0.084269 0.00119171
10 1.6906e-03 17 0.076017 0.078050 0.00114856
11 1.4541e-03 19 0.072636 0.073039 0.00111238
12 1.4454e-03 21 0.069728 0.071094 0.00109797
13 1.2877e-03 23 0.066837 0.069851 0.00108864
14 1.2176e-03 25 0.064262 0.067118 0.00106781
15 9.9862e-04 27 0.061827 0.064087 0.00104416
16 7.1830e-04 28 0.060828 0.061546 0.00102386
Page 43 of 79
17 5.0807e-04 30 0.059391 0.060215 0.00101304

18 2.9783e-04 31 0.058883 0.059970 0.00101103
19 1.4600e-04 33 0.058288 0.059461 0.00100686
20 9.9278e-05 36 0.057850 0.059041 0.00100339
21 9.6358e-05 39 0.057552 0.058848 0.00100179
22 7.0078e-05 43 0.057166 0.058708 0.00100063
23 5.2559e-05 44 0.057096 0.058515 0.00099903
24 3.5039e-05 45 0.057044 0.058358 0.00099772
25 1.5573e-05 46 0.057009 0.058288 0.00099714
26 1.1680e-05 55 0.056869 0.058218 0.00099656
27 1.0011e-05 58 0.056834 0.058218 0.00099656
28 8.7598e-06 65 0.056763 0.058253 0.00099685
29 7.0078e-06 73 0.056693 0.058393 0.00099801
30 6.3708e-06 78 0.056658 0.058533 0.00099918
31 4.3799e-06 89 0.056588 0.058691 0.00100049
32 1.7520e-06 133 0.056255 0.059058 0.00100353
33 0.0000e+00 143 0.056238 0.059934 0.00101074
Interpretation of the CART model output including pruning, plot of the pruned tree
The Root Node had total observation of 126363 of which 57079 observations did not have Late deliver
risk. The error rate at the Root Node is 45% (or) otherwise the impurity factor is 45%. The objective of
CART splitting it to get purity in the node (or) reducing the error rate by splitting.
There is technique this algorithm uses called K Fold Cross Validation which is resampling procedure.
The Cost Complexity factor CP value determines up to what level should we cut the tree.
The tree tells us the root note CP is high, split is 0, relative error and cross validation errors are 1
each, standard deviation amongst cross validated group is 0.00309933.
As tree builds the relative error decreases, these are in-sample errors. However cross validation
error/Standard deviation decrease as the tree is cut to 3,4,6 etc. In a CART model there would be
inflexion point beyond which cutting tree further is sub-optimal. In this case 55 nodes looks optimal
and tree is complex because of higher number of splits involved
Pruning the model and plotting the Pruned Tree
Fig 3. 4- CART Complexity Parameter-Visualisation
The Tree looks complex and Pruning of the tree may not be required as the CP is e05 for the 45 Split.
Model Tuning: -
However, we tried CP value of 0.0000011680 to see how the pruned tree looks like Fig 3.5
Page 44 of 79
Fig 3. 5 - CART Pruned Tree
The Pruned Tree is complex as well since there are many predictors, hence could not yield better
visualisation
CP also called as Cost Complexity chart below for the above pruned tree.
Classification tree:
rpart(formula = SCM_train$Late_delivery_risk ~ Revenue + Profit +
data = SCM_train, method = "class", control = r.ctrl)
Variables actually used in tree construction:

[1] Discount Location Profit
Revenue Schedule
Shipping.ModeFirst.Class
[7] Shipping.ModeSecond.Class TypeCASH TypeDEBIT
TypePAYMENT
Root node error: 57079/126363 = 0.45171
n= 126363
CP nsplit rel error xerror xstd

1 1.9997e-01 0 1.000000 1.000000 0.00309933
2 9.0331e-02 3 0.390319 0.390564 0.00237389
3 3.8788e-02 4 0.299988 0.300408 0.00213280
4 3.2516e-02 6 0.222411 0.222989 0.00187435
5 2.0402e-02 7 0.189895 0.183763 0.00171820
6 1.7765e-02 9 0.149092 0.150879 0.00156945
Page 45 of 79
7 1.1948e-02 10 0.131327 0.133096 0.00148041

8 3.4514e-03 13 0.088789 0.090857 0.00123549
9 2.9345e-03 15 0.081887 0.084269 0.00119171
10 1.6906e-03 17 0.076017 0.078050 0.00114856
11 1.4541e-03 19 0.072636 0.073039 0.00111238
12 1.4454e-03 21 0.069728 0.071094 0.00109797
13 1.2877e-03 23 0.066837 0.069851 0.00108864
14 1.2176e-03 25 0.064262 0.067118 0.00106781
15 9.9862e-04 27 0.061827 0.064087 0.00104416
16 7.1830e-04 28 0.060828 0.061546 0.00102386
17 5.0807e-04 30 0.059391 0.060215 0.00101304
18 2.9783e-04 31 0.058883 0.059970 0.00101103
19 1.4600e-04 33 0.058288 0.059461 0.00100686
20 9.9278e-05 36 0.057850 0.059041 0.00100339
21 9.6358e-05 39 0.057552 0.058848 0.00100179
22 7.0078e-05 43 0.057166 0.058708 0.00100063
23 5.2559e-05 44 0.057096 0.058515 0.00099903
24 3.5039e-05 45 0.057044 0.058358 0.00099772
25 1.5573e-05 46 0.057009 0.058288 0.00099714
26 1.1680e-05 55 0.056869 0.058218 0.00099656
27 1.0011e-05 58 0.056834 0.058218 0.00099656
28 8.7598e-06 65 0.056763 0.058253 0.00099685
29 7.0078e-06 73 0.056693 0.058393 0.00099801
30 6.3708e-06 78 0.056658 0.058533 0.00099918
31 4.3799e-06 89 0.056588 0.058691 0.00100049
32 1.7520e-06 133 0.056255 0.059058 0.00100353
33 0.0000e+00 143 0.056238 0.059934 0.00101074
Variable Importance: -
Ranking the Importance of the Independent variable, shows Schedule, Location, Shipping Mode
Second Class, Shipping Mode First Class are important variables in determining Later delivery risk.
Model Prediction (Train and Test Data)

With the above CART model built on the training data we did model prediction of train data and also
the test data to predict i.e. if we have to randomly pick an element in the node and what would be its
classification with respect to purchase of loan and what is the probability score associated with the
prediction.
Model Evaluation: -
The measure of performance for predictive models – CART were evaluated through below methods: -
a. Confusion Matrix: - For the class output from both the models classification error of
predicted vs actuals were drawn to understand the Accuracy- The ratio of classifications that
were done correctly and Sensitivity i.e. (proportion of total positives that were correctly
identified of the model.
b. ROC/AUC Curves- With probability outputs of the prediction ROC curve- Receiver Operating
charestrestic was drawn. KS- Kolmogorov-Smirnov and Lift charts were studies for the test
model
c. GINI Coefficient – which is 2* AUC-1 was also studied for the test model
Page 46 of 79

> with(SCM_train, table(SCM_train$Late_delivery_risk,
SCM_train$predict.class))
0 1
0 54062 3017
1 193 69091
Train Data: -
Table 3. 14– CART - Confusion Matrix Tuned- Results on Train Data

CART - Confusion Matrix- Train Data
CART-Predict
95.82%, F Measure- 97.15%
Interpretation: -
In train data, 69091 observations were predicted as Late delivery, and 193 observations predicted as
No late delivery. The wrong predictions were 3017 predictions which were predicted as late delivery,
but they were not late delivery and 193 predictions which were predicted as no late delivery were
actually late delivery
Test Data: -
> with(SCM_test, table(SCM_test$Late_delivery_risk,
SCM_test$predict.class))
0 1
0 23150 1313
1 124 29569
> nrow(SCM_test)
[1] 54156
Table 3. 15 – CART - Confusion Matrix Tuned- Results on Test Data
CART - Confusion Matrix- Test Data

CART-Predict
95.75%, F Measure- 97.04%
Table 3. 16 – CART - Confusion Matrix Tuned- Final Results-Test Data

CART Evaluation Parameters
CART Model 97.35% 99.58% 94.63% 95.75% 97.04%
Page 47 of 79
Interpretation: -
----------------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------------------------------------
IN test data, 29569 observations were predicted as Late delivery, and 124 observations predicted as
actually late delivery.
model has produced 99.58%.
Test data has performed closer to the train data; hence the conclusion is the CART model is
robust.
ROC- AUC- KS Evaluation: -
The results from the CART model was reviewed with ROC-AUC parameters and the model evaluation
presented in (Fig-3.6)
Fig 3. 6- CART – ROC- AUC Chart
KS & AUC:
> KS.CART.Test
[1] 0.9435323
Page 48 of 79
> auc.CART.Test
[1] 0.9920192
Interpretation: -
Test data ROC curve curved towards the left (True positive rate), this indicates good proportion of data
expected to be predicted correctly
KS and AUC supports the ROC curve with higher percentages- 94.35% & 99.20% respectively, which
indicate the CART model is robust in test data.
GINI Coefficient: -
> gini.CART.Test
[1] 0.4454714
Gini index is a CART algorithm which measures a distribution among affection of specific-field with the
result of instance. It means, it can measure how much every mentioned specification is affecting
directly in the resultant case.
Gini index is used in the real-life scenario. And data is real which is taken from real analysis. In many
definitions, they have mentioned as ‘an impurity of data’ or we can say ‘How much-undistributed
data is’. From this, we can also measure that which data from every field is taking less (or) more part
in the decision-making the process. So further we can focus on that particular field/variable.
CART (Classification and Regression Trees) → uses Gini Index(Classification) as metric. If all the data
belong to a single class, then it can be called pure. Its Degree will be always between 0 and 1. If 0,
means all data belongs to the single class/variable. If 1, the data belong to the different class/field.
Here the GINI is 44.54% which shows no skewness
3.5 APPLYING RANDOM FOREST, MODEL TUNING, MODEL EVALUATION &

INTERPRET RESULTS
What is Random Forest & Purpose:
Random Forest is also Supervised (Supervised means the Target to be achieved is known) Machine
Learning Technique to build Prediction Model. Since the decision trees are very sensitive to even
small changes in the data, usually they are unstable. Instead of one CART tree the big idea is to grow
more CART trees, which are otherwise called a Forest of CART trees, which can improve the
robustness of the prediction model built. The idea is individual trees tend to over-fit training data
(refer to the earlier section on Over-Fit), hence averaging correct this.
The Algorithm: -
Since multiple CART trees are built, to avoid the all trees looking similar Randomness technique is
used. Typically, the model is to pick random values with replacement. This technique is also called
ensemble technique, since multiple CART models are built. For sampling Bootstrap aggregating also
called as Bagging is used to arrive at population parameters, which is randomly subset the sample
data with replacement, by sampling with replacement some of the observations may be repeated in
each subset. Bootstrap not only samples rows but also columns (or variables) e.g. 12 variables, boot
strap will build model using say 5 variables selected at random each time to build the model.
The algorithm measures error rate called OOB (Out of Bag errors) e.g. A model is built with 70% of
data say total data is 1000 and model is built on 700 and for the balance 300 predict class is assigned,
200 classified correctly and 100 were errors, this error ratio is called OOB (Out of Bag errors). Pruning
is not needed, however tuning of the tree can be done in the algorithm to get optimal output.
RANDOM FOREST with given data set:
Page 49 of 79
This data was split in to Training and Testing Data and first the RF model was applied to the Training
data and Random Forest was constructed on trial and error method before tuning. The output of the
Random Forest displayed below:
Call:
randomForest(formula = SCM_train_RF$Late_delivery_risk ~ Revenue +
Profit + Discount + Location + Schedule + TypeCASH + TypeDEBIT +
Shipping.ModeSecond.Class, data = SCM_train_RF, ntree = 101, mtry = 5,
nodesize = 100, importance = TRUE)
Type of random forest: classification
Number of trees: 101
No. of variables tried at each split: 5
OOB estimate of error rate: 2.47%

Confusion matrix:
0 1 class.error
0 53977 3102 0.0543457314
1 24 69260 0.0003464003
The Error Rate reduction can be viewed visually in the below diagram (Fig- 3.7)
Fig 3. 7- Random Forest Train Trees Vs Error
Model Tuning: -
We can further tune the Random forest using the tuning algorithm and output of tuning below:-
From the graph it is understandable that the algorithm tried 3,4,5 different mtry combinations and after
10 found errors increasing. So mtry of 10 and ntree of 81 can give optimal results. Hence, trying
with those parameters produced the below output
Call:
randomForest(formula = SCM_train_RF$Late_delivery_risk ~ Revenue +
Profit + Discount + Location + Schedule + TypeCASH + TypeDEBIT +
Shipping.ModeSecond.Class, data = SCM_train_RF, ntree = 81, mtry = 10,
nodesize = 100, importance = TRUE)
Type of random forest: classification
Number of trees: 81
No. of variables tried at each split: 10
OOB estimate of error rate: 2.59%

Confusion matrix:
0 1 class.error
0 53927 3152 0.055221710
1 122 69162 0.001760868
Page 50 of 79
Variable Importance: -
Ranking the Importance of the Independent variable, shows Schedule, Location, Shipping Mode-
Second Class, First class are important independent variables that determines late deliver risk shows
in (Fig 3.8)
Fig 3. 8- Random Forest Variable Importance
Model Prediction (Train and Test Data)

With the above Random Forest model built on the training data we did model prediction of train data
and also the test data to predict i.e. if we have to randomly pick an element in the node and what
would be its classification with respect to purchase of loan and what is the probability score associated
with the prediction.
Model Evaluation: -
The measure of performance for predictive models – Random Forest are evaluated through below
methods: -
a. Confusion Matrix: - For the class output from both the models classification error of
predicted vs actuals were drawn to understand the Accuracy- The ratio of classifications that
were done correctly and Sensitivity i.e. (proportion of total positives that were correctly
identified of the model.
b. ROC/AUC Curves- With probability outputs of the prediction ROC curve- Receiver Operating
charestrestic was drawn. KS- Kolomogorov-Smirnov and Lift charts were studies for the test
model
c. GINI Coefficient – which is 2* AUC-1 was also studied for the test model

> tbl.train.rf=table(SCM_train_RF$Late_delivery_risk,
SCM_train_RF$predict.class)
> tbl.train.rf
0 1
0 53950 3129
1 101 69183
Train Data: -
Table 3. 17 – Random Forest - Confusion Matrix Tuned- Results on Train Data

RANDOMFOREST - Confusion Matrix- Train Data
Page 51 of 79
Random Forest-Predict
95.67%, F Measure- 97.11%
Interpretation: -
IN train data, 69183 observations were predicted as Late delivery, and 101 observations predicted as
actually late delivery
Test Data: -
> tbl.test.rf=table(SCM_test_RF$Late_delivery_risk,
SCM_test_RF$predict.class)
> tbl.test.rf
0 1
0 23122 1341
1 36 29657
Table 3. 18 – Random Forest - Confusion Matrix Tuned- Results on Test Data

RANDOMFOREST - Confusion Matrix- Test Data
Random Forest-Predict
95.67%, F Measure- 97.12%
Table 3. 19 – Random Forest - Confusion Matrix Tuned- Final Results-Test Data

Random Forest Evaluation Parameters
Random Forest Model 97.46% 99.88% 94.52% 95.67% 97.12%
Interpretation: -
----------------------------------------------------------------------------------------------------------------------------------------
Page 52 of 79

----------------------------------------------------------------------------------------------------------------------------------------
In test data, 29657 observations were predicted as Late delivery, and 36 observations predicted as No
late delivery. The wrong predictions were 1341 predictions which were predicted as late delivery, but
they were not late delivery and 36 predictions which were predicted as no late delivery were actually
late delivery.
model has produced 99.88%.
Test data has performed closer to the train data; hence the conclusion is the Random forest model is
robust.
ROC- AUC- KS evaluation: -
The results from the Random Forest model was reviewed with ROC-AUC parameters and the model
evaluation presented in (Fig-3.9)
Fig 3. 9 - Random Forest TEST- ROC Curve
> KS.RF.Test
Page 53 of 79
[1] 0.9441506
> auc.RF.Test
[1] 0.9895402
Interpretation: -
Test data ROC curve curved towards the left (True positive rate), this indicates good proportion of data
expected to be predicted correctly
KS and AUC supports the ROC curve with higher percentages- 94.41% & 98.95% respectively, which
indicate the Random forest model is robust in test data.
GINI Coefficient
> gini.RF.Test
[1] 0.432492
Gini index is a Random forest algorithm which measures a distribution among affection of specific-field
with the result of instance. It means, it can measure how much every mentioned specification is
affecting directly in the resultant case.
Gini index is used in the real-life scenario. And data is real which is taken from real analysis. In many
definitions, they have mentioned as ‘an impurity of data’ or we can say ‘How much-undistributed data
is’. From this, we can also measure that which data from every field is taking lessor more part in the
decision-making the process. So further we can focus on that particular field/variable.
Random forest uses Gini Index(Classification) as metric. If all the data belong to a single class, then it
can be called pure. Its Degree will be always between 0 and 1. If 0, means all data belongs to the
single class/variable. If 1, the data belong to the different class/field. Here the GINI is 43.25% which
indicates no skewness.
3.6 APPLYING ENSEMBLE METHODS- BAGGING, MODEL TUNING, MODEL

Ensemble methods is use of multiple learning algorithms to obtain better predictive performance than
could be obtained from any of the constituent learning algorithms
What exactly is an ensemble method?
Training multiple models using the same algorithm. Many times you will hear this is a way to create a
strong learner from a weak one. Ensemble methods can be used to try and minimize, bias and
variance.
Types of Ensemble Models used are majorly two

1. Boot-strapping Aggregating (Bagging)
2. Boosting
Bagging: Multiple models are built using observations with replacement. The m models are fitted
using the m bootstrap samples and combined (aggregated) by averaging the output (for regression) or
voting (for classification)
Bagging (aka Bootstrap Aggregating): is a way to decrease the variance of the prediction by
generating additional data for training from your original dataset using combinations with repetitions to
produce multisets of the same cardinality/size as that of original data.
We applied Bagging method to the train data to build model and applied the model to test data to
predict. The output from the model as below.
> BaggingModel <- bagging(SCM_train$Late_delivery_risk ~ Revenue + Profit +

Discount + Location + Schedule + TypeCASH +TypeDEBIT +
Shipping.ModeSame.Day +Shipping.ModeSecond.Class,
Page 54 of 79
+ data = SCM_train,
+ control=rpart.control(maxdepth=5, minsplit=4))
> BaggingPredict = predict(BaggingModel, newdata = SCM_test)
> tabBagging = table(SCM_test$Late_delivery_risk, BaggingPredict)
> tabBagging
BaggingPredict
0 1
0 21722 2741
1 1388 28305
Model Evaluation: -
The measure of performance for predictive Ensemble models, evaluated through method of
Table 3. 20 – Bagging - Confusion Matrix Tuned- Results on Test Data
BAGGING - Confusion Matrix- Test Data

Bagging-Predict
91.17%, F Measure- 91.94%
Table 3. 21 – Bagging - Confusion Matrix Tuned- Final Results-Test Data
SCM- Test Data Evaluation Parameters

Bagging Model 92.38% 95.33% 88.80% 91.17% 91.94%
Interpretation: -
----------------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------------------------------------
Page 55 of 79
Sensitivity/Recall is i.e. Model is able to spot the late delivery up to 95% and Specificity which non-late
delivery prediction is 91%.
outcome where the model correctly predicts the positive class) is of prime importance, which the
model is able to predict 95% and false alarm is yet fine. The Model result is robust.
Bias-Variance Trade-off
The bias is an error from erroneous assumptions in the learning algorithm. High bias can cause an
algorithm to miss the relevant relations between features and target outputs (under fitting).
The variance is an error from sensitivity to small fluctuations in the training set. High variance can
cause an algorithm to model the random noise in the training data, rather than the intended outputs
(overfitting).
Table 3. 22 – Bias Vs Variance
Models Low Variance High Variance
Somewhat accurate but inconsistent

Low Bias accurate and consistent on average
on averages
Consistent but inaccurate on Model inaccurate and inconsistent on

High Bias
average average
It is to be noted that Bagging reduces the variance, but retains some of the bias
3.7 APPLYING ENSEMBLE METHODS- BOOSTING, MODEL TUNING, MODEL

Boosting:
In boosting the models prepared are and learns from the weaker model and builds it to reduce the
error rates or any model evaluation measuring parameters. Boosting is the idea of training weak
learners sequentially.
There are different boosting techniques, which are as listed below.
AdaBoost (Adaptive Boosting) –building on weak learners combining decision stumps and
weighting incorrect observations.
Gradient Boosting –builds on each model, trying to fit the next model to the residuals of the previous
model
XGBoost (Extreme Gradient Boosting) –a specialized implementation of gradient bosting decision
trees designed for performance. Three main types are: gradient boosting, stochastic gradient boosting
and regularized gradient boosting
XG Boost Model with Train Data: -
We first applied XGBoost method to the train data to build model and applied the model to test data to
predict. The output from the model as below.
XGBoost works with matrices that contain all numeric variables. All categorical to be converted to
dummies, we also need to split the training data and label. Boosting method using binary categorical
variables and all numeric variables.
Page 56 of 79
> XGBpredTest = predict(xgb.Fit, features.XGtest)

> tabXGB = table(SCM_test$Late_delivery_risk, XGBpredTest>0.5)
> tabXGB
FALSE TRUE
0 23143 1320
1 17 29676
Model Evaluation: -
The measure of performance for predictive Ensemble models, evaluated through method of
Table 3. 23 – Boosting - Confusion Matrix Tuned- Results on Test Data

BOOSTING- Confusion Matrix- Test Data
Bagging-Predict
XG-BOOST 0- No Late Delivery 1- Late Delivery
95.74%, F Measure- 97.20%
Table 3. 24 – Boosting - Confusion Matrix Tuned- Final Results-Test Data
SCM- Test Data Evaluation Parameters

XG Boosting Model 97.53% 99.94% 94.60% 95.74% 97.20%
Interpretation: -
----------------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------------------------------------
Page 57 of 79
Sensitivity/Recall is i.e. Model is able to spot the late delivery up to 99.94% and Specificity which non-
late delivery prediction is 94.60%.
outcome where the model correctly predicts the positive class) is of prime importance, which the
model is able to predict 99% and false alarm is yet fine. The Model result is robust and no fine
tuning needed and the model has achieved its purpose.
Bias-variance trade-off: -
The bias is an error from erroneous assumptions in the learning algorithm. High bias can cause an
algorithm to miss the relevant relations between features and target outputs (under fitting).
The variance is an error from sensitivity to small fluctuations in the training set. High variance can
cause an algorithm to model the random noise in the training data, rather than the intended outputs
(overfitting).
Boosting helps reduce both bias and variance
3.8 MODEL VALIDATION TO COMPARE MODELS AND FIND THE BEST

PERFORMED MODEL
The below matrix is comparison between the models (3.1 to 3.7) comparing various parameters we
evaluated the model performance and the comparison reflected in the matrix below.
Table 3. 25 – Model Selection- Comparison Matrix
First Best Second Best Third Best
Comparing the all the Models Classification models- CART, Random Forest and Ensemble Model-
Boosting performed the best. In the above table ranking indicated with colour codes to indicate the
ranks of the results as Green- First best result, Amber- Second Best & Yellow- Third Best.
Amongst the model Ensemble models yielding best result overall, amongst the Ensemble models XG-
Boosting produced the best results in terms of Accuracy, Sensitivity, Specificity, Precision and F
Measure.
In this case, we are predicting whether the delivery will be done on time or not with intention to identify
the reason for the late deliveries. Hence, identifying the True Positive I.e. Late Delivery risk is of
utmost importance. For this purpose, Sensitivity, Precision and Accuracy pays a vital role
combined with F measure
Conclusion: -
Amongst the Models Ensemble Method with XG Boosting stood out in this parameters, hence
considered to the best model. Random Forest is the second best
Page 58 of 79
SECTION 4: FINDINGS & INSIGHTS, DATA CONSTRAINTS & MODEL

INTERPRETATION
4.1 FINDINGS & BUSINESS INSIGHTS
The problem statement and the reason for embarking on this model development was the delivery
activty in the supply chain process of the company – Data Co had problems. The underlying problem
associated with the data is there are late deliveries which lead to bad customer experience, which
affects the profitability- both Top Line and Bottom line, decrease in Sales, and hence prediction
modelling study was conducted with the provided data set to
 Analyse the timelines of deliveries
 Adherence to the stipulated timelines of the delivery (Committed timelines are met/or not)
 Reasons for delay in the given set of transactions/orders
Findings & Business Insights: -
Below are the findings and insights from the analysis of the data conducted in this research project for
Data Co. Supply Chain.
Who are the customers and where are they located?
 There are a total of 164 order destination countries, six markets areas and 23 regions
 Top customers in Peurtorico, California and New York
 Top customers are located in cities Cagus, Chicago, Los Angeles and New york
 The most preferred mode of payment for the orders is by “Debit” followed by “Transfer
Where/What are the customers shipping?
 USA, France, Mexico and the top selling countries
 Orders are predominantly moving to LATAM, Europe & Asia Pacific Markets
 Orders are predmoinantly to Santo Domnigo, Ney York, LA, Tegulgalpa & Managua
 Products sold are Fishing, Camping & Hiking, Women Apparel, Indoor/Outdoor games, Water
Sports Cardio Equipment, Cleats, Mens Footwear
 Mens Footwear, Water Sports, Indoor/outdoor games are loss making
 Field & stream sportsman, 16 Gun fire safe, Perfect Fitness, Perfect rip deck, Pelican sun stream
Kayak are popular products
What are the preferred modes of shipment?
 Least no of orders are Same day delivery
 Most preferred mode of shipping is Standard Class followed by Second Class
 Consumer 52% of sales comes from, Corporate is 30% and Home office is 17%
 Sales is showing decling trends in the recent years, which is worrying factor for the company
Which areas delivered late?
 Very high no of orders are of Late Delivery in Delivery Status
 The proportion of late_deliery_risk in the total dataset – There are about 54.8% orders with Late
delivery risk
 Late delivery (54%) is worrying factor for the decline in sales and it was observed all order regions
run the risk of late delivery
 Late delivery primarily observed in First class (15%), Second Class (15%) and Standard
Class(23%)
 Major delays are observed for orders to LATAM (16%), Europe (15%) markets
 Cleats, Women apparel, Indoor/outdoor games, Cardio equipments are top segments by order
and they run the risk of late delivery
What are the reasons for the delivery delays?
Page 59 of 79
The scores for actual days of shipping for both – deliveries without delays and with delays depict left
skewness with higher left skewness for delayed deliveries. This shows that delays have occurred
after the product has been shipped.
The reasons for the delay (or) late delivery could not be identified with given information. Hence, we
recommend business to help with the following addition data on A. Product Flow B. Information Flow
C. Revenue Flow.
1. Location- Both Origin Place and Destination Place
2. Mode of Shipment- Air, Ocean, Rail, Combined
3. Transhipment involved
4. Idle time – Transhipment, Trucker sleep time, Clearance Paper work etc.
5. Expected Transit time for the mode
6. Parties involved in transportation
7. Parties Schedule reliability measures
8. Parties Communication channels- Information flow
9. Customs Clearance- Involved.
10. Turn Around Time- Customs
11. Payment TAT
Which variables causes delivery delays?

Overall adherence to timelines of deliveries was only 45% i.e. 55% of the orders were delivered late.
The late deliveries were primarily observed in First (15%), Second Class (15%) and major delays
observed for orders to LATAM (16%), Europe (15%) markets.
From the variables provided in the data set after analysing parameters of influence to Late Delivery- it
was identified the below variables were significant and any change in the below mentioned variables
will have impact on Delivery.
1. Location – Latitude and Longitude of the location
2. Schedule- Days of Shipping Scheduled, real
3. Shipping Mode- First Class, Second Class, Same day and Standard Class
4. Type of Payment- Cash, Debit, Payment
4.2 DATA CONSTRAINTS & MODEL INTERPRETATION

The interpretation of the presented models and the data study that was conducted and presented
above (Section 2 & 3), which the readers of this report to be aware of are listed below.
 The given data is mix of continuous and categorical variables
 Many of the variables did not have impact on the Target variable- Late Delivery Risk, hence
could be filtered at early stage
 Few variables had missing values, but the proportion was high, hence variables was
ignored.
 Many of the continuous variables had outliers, hence outlier treatment was necessary
 The Independent variables were highly correlated amongst each other, hence the situation
of multicollinearity which will affect the model existed. PCA-FA method clustering was done
to combine those variables in to Factors.
 Independent Categorical variables were also correlated, hence correlated categorical
variables were dropped.
 The data had scale differences, hence scaling helped to standardise/normalise the data
 Logistic Regression it was essential to consider only the important uncorrelated
independent variables
 Naïve Bayes model gives a conditional probability for all dependent variables and hence the
need to remove highly co-related variables does not arise for this model
 KNN works best only for continuous variables with no outliers, hence only numeric
independent variables considered for the model building
 Logistic Regression can derive confidence level (about its prediction), whereas KNN &
Naïve Bayes can only output the labels
Page 60 of 79
 CART- Trees were big in size; hence visualisation of the tree could not be presented to its
best.
 RANDOM Forest different mtry combination may yield different results.
 Decision Trees- CART and Random Forest produced better results compared to Logistic
regression (or) Frequency Based Algorthims.
 Ensemble models - XGBoost works with matrices that contain all numeric variables. All
categorical to be converted to dummies
 Learning model XG Boost produced the best results and hence considered as best
model based on the parametric evaluation
SECTION 5: CHALLENGES FACED DURING RESEARCH OF PROJECT

AND TECHNIQUES USED TO OVERCOME THE CHALLENGES
Data Preparation: -
The data had many outliers, many predictors were correlated amongst each other leading to situation
of multi-collinearity and many predictors did not have predicting capability. Hence, 80% of time spent
on cleaning and preparing the data to improve its quality i.e. to make it consistent, before utilising for
analysis.
Getting The Right Data: -
Quality is better than quantity is the call of the hour in this case. The business problem involves
understanding the reason for delay or late delivery, however with the given data set such reasons
could not be identified, hence recommended (in recommendation section) for additional data on A.
Product Flow B. Information Flow C. Revenue Flow.
Thus to build an accurate model which works well with the business it is necessary to get the right
data with the most meaningful features at the first instance. To overcome this data issue, would need
to communicate with the business to get enough data and then use domain understanding to get rid of
the irrelevant features. This is a backward elimination process but one which often comes handy in
most occasions.
SECTION 6: RECOMMENDATIONS, CONCLUSIONS/APPLICATIONS

The objective of this case study is to find “The best model which can predict Late Delivery Risk. Also,
which variables are a significant predictor behind the decision. We developed prediction models by
studying the data set provided using Logistic Regression, KNN, Naïve Bayes predictive methods,
applied Decision tree methods- CART, Random Forest, used Machine Learning techniques like
Bagging and Boosting. We found Machine learning- XG Boost method to have provided the
better model considering higher Accuracy, Sensitivity and Precision to identify the Late
Delivery.
The following insights elucidated from this study and hence the recommendations to the business are:
 In the given dataset, we can infer and/or predict late deliveries based on the limited information
provided on the product price, discount, profitability, sales and quantity sold, shipping timelines –
real and scheduled and location of store from where products are shipped.
 It is important to get some more information regarding the Origin-Destination, transit time involved,
vendor schedule reliability, Idle time in transportation to identify the cause and tune the model for
better prediction.
 There is not data available on the Schedule Reliability and Vendor performance, it is
recommended business to provide data on “Schedule Reliability”, “On time Delivery”. If no such
measures available introduction of such KPI measure for “Staff Performance”, “Vendor
Performance” to boost performance.
Page 61 of 79
 For products with higher discounts, there is an increased risk of delay in delivery. Due to higher
discounts, there are high volumes of product orders giving rise to difficulties in on-time deliveries
with existing logistics plans/ resources. Suggestion is to carefully plan logistics when discounts are
offered.
 Lower uptake of Same Day (5%)& First Class (15%) opportunity to improve delivery performance
and charge premium to customers to improve revenue.
Other best practices from the Supply chain industry listed below only as a suggestion to review and
take advantage of the upcoming trends.
 Flow of information throughout the supply chain End to End is of utmost importance for prompt
delivery. Hence, invest on technology like- IOT, Block Chain to develop platforms were all parties
can be on one system exchange information seamlessly
 Creating Transparency for real time tracking, publishing the delivery results both at transaction
and cumulated transaction, so everyone in the supply chain knows timeliness of delivery.
 Feedback from the chain what caused the delays, so improvements could be seek through crowd
sourcing.
 Assessing the traffic situation, embark on usage of “Drones” and “Robotic Arms” to deliver goods
much faster
 Controlled Inventory situation to keep stock of the fast moving goods, and to avoid the “Bull-Whip-
Effect” using periodization models like ABC analysis etc.
SECTION 7: BIBILIOGRAPHY
https://towardsdatascience.com/ , Analytics Vidhya (http://www.analyticsvidhya.com/), Data

Science Central (http://www.datasciencecentral.com/), Simply Statistics (http://simplystatistics.org/),
R-Bloggers (http://www.r-bloggers.com/), Wikipedia, Investopedia
(https://www.investopedia.com/terms/s/scm.asp)
----End of Report-----
APPENDIX A
---------------------------------------------------------------------------------------------------------------------------
Appendix A covers the following chapters: -
A1 R-SOURCE CODE
A2 TABELEAU VISUALISATION SOURCE CODE
Page 62 of 79
A3 UNIVARIATE ANALYSIS
A4 BIVARIATE ANALYSIS
---------------------------------------------------------------------------------------------------------------------------
A1 R-SOURCE CODE
SCM_Project_Final_
Rcodes_Hariharan.KP.R
A2 TABELEAU VISUALISATION SOURCE CODE

https://public.tableau.com/profile/hariharan3667#!/vizhome/Data_Co_supply_chain/Lat
e-DeliveryRisk-Market?publish=yes
Page 63 of 79
Page 64 of 79
Page 65 of 79
Page 66 of 79
A3 UNIVARIATE ANALYSIS
Univariate analysis: -
# Nominal, Ordinal & Geo spatial Variables#
Below variables which are nominal, ordinal and Geospatial in nature were not considered for the
univariate analysis.
Customer Id, Customer Zip code, Department Id, Latitude, Longitude, Order Customer Id, Order Id,
Order Item Card Prod Id, Order Item Id, Order Zip code, Product Card Id, Product Category ID,
Masked Customer Key
Page 67 of 79
# Numeric Variables#
1. Days for Shipping Actual
Appendix- Fig-1
Inferences:
The minimum days of actual ship days is 0, while the maximum is 6. Between these two values, the
actual ship days looks to be spread. The Mean (3.00) and Median (3.498) data is right skewed. No
outliers observed
2. Days for Shipping Scheduled
Appendix- Fig-2
Inferences:
The minimum days of scheduled ship days is 0, while the maximum is 4. Between these two values,
the scheduled ship days looks to be spread. The Mean (2.9) and Median (4.0) data is right skewed.
No outliers observed
3. Benefits per order
Appendix- Fig-3
Page 68 of 79
Inferences:
The minimum days of benefits per order is -4274.98, while the maximum is 911.8. Between these two
values, the data is heavily left skewed. The Mean (21.9) and Median (31.5). Many outliers observed in
the data. We can see that there is less benefits per order of many transactions in the negative
region
4. Sales Per Customer
Appendix- Fig-4
Inferences:
The minimum sales per customer is 7.49, while the maximum is 1939.99. Between these two values,
the data is heavily right skewed. The Mean (183.1) and Median (163.9). Many outliers observed in the
data.
5. Category ID
Appendix- Fig-5
Inferences:
Page 69 of 79
The minimum category id is 2, while the maximum is 76. Between these two values, the data is
distributed. The Mean (31.8) and Median (29). No outliers in data.
6. Order Item Discount
Appendix- Fig-6
Inferences:
The minimum order item discount is 0, while the maximum is 500. Between these two values, the data
is right skewed. The Mean (20.6) and Median (14). Many outliers in this data
7. Order Item Discount Rate (in percentage)
Appendix- Fig-7
Inferences:
The minimum order item discount rate is 0, while the maximum is 0.25. Between these two values, the
data is slightly left skewed. The Mean (0.10) and Median (0.1017). No outliers in this data
8. Order Item Product Price
Appendix- Fig-8
Page 70 of 79
Inferences:
The minimum order item product price is 9.99, while the maximum is 1999.99. Between these two
values, the data is slightly right skewed. The Mean (141.23) and Median (59.99). Few outliers in this
data
9. Order Item Profit Ratio (In Percentage)
Appendix- Fig-9
Inferences:
The minimum order item profit ratio is -2.75, while the maximum is 0.50. Between these two values,
the data is heavily left skewed. The Mean (0.12) and Median (0.27). many outliers in this data. We
can see that there is less order per order of many transactions in the negative region
10. Sales
Appendix- Fig-10
Inferences:
The minimum sales is 9.99, while the maximum is 1999.99. Between these two values, the data is
heavily right skewed. The Mean (203.77) and Median (199.92). few outliers in this data
11. Order Item Total

Appendix- Fig-11
Page 71 of 79
Inferences:
The minimum order item total is 7.49, while the maximum is 1939.99. Between these two values, the
data is heavily right skewed. The Mean (183.11) and Median (163.99). Many outliers in this data
12. Order Profit Per order
Appendix- Fig-12
Inferences:
The minimum profit per order is -4274.98, while the maximum is 911.80. Between these two values,
the data is heavily left skewed. The Mean (21.98) and Median (31.52). Many outliers in this data
13. Product Price
Appendix- Fig-13
Inferences:
The minimum product price is 9.99, while the maximum is 1999.99. Between these two values, the
data is heavily right skewed. The Mean (141.23) and Median (59.99). Few outliers in this data.
Majority of the products price is less than 1000, 442 products have price of 1500 and 15 products price
of 20000.
Page 72 of 79
# Categorical Variables#
Appendix- Fig-14
Appendix- Fig-14
Inferences:
Type- Customers who transacted by Debit was the highest 38% followed by transfer 28% and
payment 23%, customer who paid by Cash was less 11%
Delivery Status- There was 55% of shipments which were late delivered, 18% delivered on time
and 23% advance shipped. 4% of orders were cancelled (could be due to poor delivery
performance)
Late Delivery Risk- 55% of shipments were at late delivery risk
Customer Segment- 52% customers were consumers, 30% were corporate customers and 18%
home office. Higher proportion of end consumers implies prompt delivery is must have for Data co.
Supply chain
Order Item Quantity- 55% customers ordered item quantity was 1. While 2, 3, 4, 5 quantity orders
were 11% each. Lower quantity orders, means higher transactions, hence efficient supply chain
needed for on time delivery.
Product Status- Product availability was 100%, which implies good inventory was carried by the
company (which also means there is inventory carrying cost associated)
Page 73 of 79
Appendix- Fig-15
Inferences:
Order Status - Only 44% of orders were completed/closed status. Rest 56% of orders is at
delivery risk and realisation of payment. There is 2% of orders which are suspected as fraud. This
implies that unless company improves on its supply chain capabilities to deliver on time cannot sustain
in the business.
Shipping Mode- 60% of orders was standard class- which is 4-day window to deliver the goods,
while 5% & 15% of orders were either same day or First class i.e. 20% of orders require to be
delivered by same day delivery. It implies that efficient supply chain mechanism needed for speed of
delivery.
Page 74 of 79
A4 BIVARIATE ANALYSIS
Categorical Vs Numerical Variables: -
## Box Plots ##
Appendix Fig-16
Inferences:
Days of shipping(real)- Box plot of Late delivery risk against actual shipping days of purchased
product shows average delivery days for late delivery is 5 days.
Days of shipping(scheduled)- Box plot of Late delivery risk against scheduled shipping days of
purchased product shows average lead time of 2 days. It is understood from the data that actual
delivery is higher than the scheduled delivery which is causing risk of late delivery.
Benefits per order- Box plot shows benefits per order is low for both timely delivery, but for late
delivery benefits gets worse.
Sales per customer- Box plot shows sales per customer is low for both timely delivery and late
delivery, however it is to be noted that losing the customer is high if late delivery continues.
Page 75 of 79
Appendix- Fig-17
Inferences:
Order Item Discount- Box plot of Order item discount shows large discounts are given for both late
delivery and on time delivery. Seems variable is non-significant for late delivery
Order Product Price- Box plot of Order product price shows similar price for both late delivery and on
time delivery. Seems variable is non-significant for late delivery
Order Profit Ratio- Box plot of Order profit ratio shows profit ratios are very thin and profit is actually
on negative side. If the company has to command premium to improve the profit ratio on time delivery
is must.
Sales- Box plot of Sales shows no significant difference between on time vs late delivery
Product Price- Box plot of Sales shows no significant difference between on time vs late delivery
Order item Total- Box plot of order item total do not show significant difference as discounts are
offered for both late delivery and on time delivery.
Page 76 of 79
Categorical Vs Categorical Variables: -

## Bar Plots ##
Appendix- Fig 18
Inferences:
Type- Bar plot shows late delivery risk is higher for all payment types except Transfer
Customer Segment- All customer segments are running the risk of late delivery
Order Item Quantity - All order item quantity is running the risk of late delivery, however proportion is
high for 1 order quantity
Product status – Products are available, but yet the proportion of late delivery risk is higher/
Shipping mode – First class and Second Class running higher risk of late delivery, while standard
class and same day delivery still has significant late delivery
Page 77 of 79
Appendix- Fig-19
Appendix- Gig-20
Inferences:
Order Status- Higher Pending Payments are because of late delivery. This variable does not seem to
have significant impact in determining late delivery as this is just status tracking
Category Name- Certain categories of goods like Cleats, Women apparels, Indoor/outdoor games,
Cardio equipment seems to carry higher risk of late delivery.
Page 78 of 79
----End-----
Page 79 of 79

Capstone Project SupplyChain DataCo Supplychain FinalReport

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Capstone Project SupplyChain DataCo Supplychain FinalReport

Uploaded by

Copyright:

Available Formats

Capstone-Project-Supplychain-Dataco- Final Report

LATE DELIVERY RISK PREDICTOR FOR

PGPBABI- ONLINE BATCH JANUARY 2020

SUBMITTED BY: K.P.HARIHARAN

ABSTRACT & LITERATURE REVIEW

 Regression based - Logistic Regression

2.4 Outlier Treatment.......................................................................................................15

Term (short form) Definition (full form) Description

CP Complexity Parameter Parameter which is used to control size of decision tree

GINI Measure of Inequality Measure of statistics disperson

IOT Internet of Things Pointing Internet Connected Objects

KNN K Nearest Neighbours Distance methodology for prediction

LR Logistic Regression Regression methodology for prediction

NB Naive Bayes Frequency Based methodology for prediction

PCA-FA Principal component Analysis- Clustering Technique variance based approach

VIF Variable Inflation Factor Measure of amount of multicollinearity

SECTION 1: INTRODUCTION, PROBLEM, OBJECTIVES, SCOPE, DATA

How to use Data science to solve business problems?

Fig 1. 1- Data Analytics Life Cycle

1.2 THE PROBLEM STATEMENT

Fig 1. 2-The Business Problem Understanding

1.3 OBJECTIVES OF THE STUDY

1.5 DATA SOURCE

Fig 1. 3 - The Data Report

DataCo. Supply Chain Data 53

Stores Customers Products Sales Orders Shipping Delivery

Customer City Scheduled

SECTION 2: EXPLORATORY DATA ANALYSIS INCLUDING DATA

2.1 VARIABLE IDENTIFICATION

First Level Check of Variables for predicting capabilities

2.2 UNIVARIATE AND BIVARIATE ANALYSIS

The output of the Univariate study is available in Appendix A

What is Bivariate Analysis: -

Following bi-variate analysis were performed for this data set.

The output of the Bivariate study is available in Appendix A

Table 2. 1 - Univariate- Bivariate study summary and recommended actions

Numeric Variable(s) Univariate Study Bivariate Study Recommendations

Categorical Variable(s) Univariate Study Bivariate Study Recommendations

2.3 MISSING VALUE TREATMENT

2.4 OUTLIER TREATMENT

Fig 2. 1- Box plot BEFORE Outlier treament

Fig 2. 2- Box plot AFTER Outlier treament

2.5 CHECK FOR MULTICOLLINEARITY

How to assess the presence of Multicollinearity?

Fig 2. 3- Correlation Plot Numeric variables- By Indicators

Fig 2. 4- Correlation Plot Numeric variables- By Numbers

Correlation Study using Chi Square for Categorical Variables: -

Table 2. 2 - Correlation Study Categoric variables- Chi Square Test

Pearson's Chi-squared test

Pearson's Chi-squared test

Pearson's Chi-squared test

2.6 DATA PREPARATION – FEATURE SCALING, BALANCING AND CLUSTERING

Some examples of algorithms where feature scaling matters are:

Table 2. 3- Scaled- Numeric Variables output

Clustering using PCA/FA: -

Fig 2. 5 - Scree Plot – Eigen Values of Components

The Elbow Bend Rule:-

Table 2. 4- Scaled- Numeric Variables output

Table 2.4 - Scaled- Numeric Variables output

Fig 2. 6 - FA Diagram – Rotation None

Labeling and interpretation of the Factors:-

Table 2. 5 - Factors interpretation with labels