Professional Documents
Culture Documents
Capstone Project SupplyChain DataCo Supplychain FinalReport
Capstone Project SupplyChain DataCo Supplychain FinalReport
Page 1 of 79
Capstone-Project-Supplychain-Dataco- Final Report
ACKNOWLEDGEMENTS
First of all, I wish to express my deepest gratitude to all the faculty members of Great Learning for their
excellent guidance and continuous support to enhance my learning in Business Analytics and Business
Intelligence.
My sincere note of gratitude with many thanks to my mentors Mr. Amit Kulkarni, Mr. Nimesh Marfatia
and coaches Mr. Animesh Tiwari, Mr. Sarabjeet Singh Kochar, Ms Karuna Kumari, for making the
learning experience more profound by discerning the complex subjects into simple explanations that
helped me to understand the subject and most important in its context precisely.
I take this opportunity to Thank the Program Office of Great Learning, Ms. Richa for helping me through
different stages of this curriculum.
Last but not least, to my family for their unconditional support and encouragement.
Purpose
This report presents the Late Delivery Predictor model that can help Data Co. Supply chain to predict
the risk of late delivery expected in the supply chain delivery
Design/methodology/approach
A review was conducted to identify classification algorthims that can provide the best results which are
regression, frequency based, decision trees and ensemble methods.
Findings
The final report identifies the impact late delivery and further provides the model that can predict late
delievery . Furthermore, the paper develops a roadmap framework for future research and practice.
Practical implications
The proposed work is useful for Data Co. Supply chain both business and data practioners as it outlines
the components for every supply chain transformation. It also proposes collection of some more data to
improve the model.
Abstract: -
The main objective of this capstone project is to develop a Late Delivery Predictor model that can
help Data Co. Supply chain to predict the risk of late delivery expected in the supply chain delivery.
The contribution of this project, which is presented in this final report hence, is to showcase the various
predictive models that can predict Late Delivery was developed with the data provided by Data Co.
Supply Chain, by using various renowned machine learning, data modelling techniques and algorithms
like:
R Studio was used as software tool to build the predictive models and Tableau software was used for
data visualisation in this project.
Page 2 of 79
Capstone-Project-Supplychain-Dataco- Final Report
The output of the various models that were built using the aforesaid techniques was then evaluated
using performance metrics like Confusion Matrix, ROC, Gini Index (as applicable) and the results were
from each model were compared to identify the best performed model, which is recommended to
business is presented in this report.
This report also shares Business Insights and findings from the data provided and recommendations
hence to make business successful using the Late Predictor tool.
Keywords:
Missing data, Outliers, Capping Technique, Central Tendency, Multi collinearity, Clustering- PCA-FA,
Feature Selection, Scaling, Sample Split, over fit, under fit, Regression, Frequency Based, Decision
Trees, Ensemble Methods, Bagging, Boosting, Confusion Matrix, ROC-AUC, GINI Index, Best Model
TABLE OF CONTENTS
Table of Contents
ACKNOWLEDGEMENTS................................................................................................................2
ABSTRACT & LITERATURE REVIEW.................................................................................................2
Purpose.............................................................................................................................. 2
Design/methodology/approach...........................................................................................2
Findings.............................................................................................................................. 2
Practical implications.......................................................................................................... 2
TABLE OF CONTENTS.................................................................................................................... 3
LIST OF TABLES............................................................................................................................ 5
LIST OF FIGURES.......................................................................................................................... 6
ABBREVIATIONS.......................................................................................................................... 7
SECTION 1: INTRODUCTION, PROBLEM, OBJECTIVES, SCOPE, DATA SOURCES, METHODOLOGY.........9
1.1 Introduction..................................................................................................................9
1.2 The Problem Statement.............................................................................................10
1.3 Objectives of the study..............................................................................................10
1.4 Scope........................................................................................................................ 11
1.5 Data Source............................................................................................................... 11
1.6 Methodology.............................................................................................................. 12
SECTION 2: EXPLORATORY DATA ANALYSIS INCLUDING DATA PREPARATION, CLEANING AND
IMPUTATION............................................................................................................................. 12
2.1 Variable Identification................................................................................................12
2.2 Univariate and Bivariate analysis...............................................................................13
2.3 Missing Value Treatment...........................................................................................15
Page 3 of 79
Capstone-Project-Supplychain-Dataco- Final Report
Page 4 of 79
Capstone-Project-Supplychain-Dataco- Final Report
LIST OF TABLES
Table 2. 1 - Univariate- Bivariate study summary and recommended actions..........................13
Table 2. 2 - Correlation Study Categoric variables- Chi Square Test.......................................17
Table 2. 3- Scaled- Numeric Variables output..........................................................................18
Table 2. 4- Scaled- Numeric Variables output..........................................................................19
Table 2. 5 - Factors interpretation with labels........................................................................21Y
Table 3. 1– Logistic Regression- Confusion Matrix-Train Data................................................32
Table 3. 2 – Logistic Regression- Confusion Matrix-Test Data................................................32
Table 3. 3 – Logistic Regression Tuned- Confusion Matrix-Train Data....................................33
Table 3. 4 – Logistic Regression Tuned- Confusion Matrix-Test Data.....................................34
Table 3. 5 – Logistic Regression Tuned- Confusion Matrix-Test Data.....................................34
Table 3. 6 – Logistic Regression Tuned- Final Results-Test Data...........................................35
Table 3. 7– Naive Bayes- Confusion Matrix on Test Data........................................................37
Table 3. 8 – Naive Bayes- Confusion Matrix Tuned- Final Results-Test Data..........................37
Table 3. 9 - KNN - Confusion Matrix Test Data- K = 19...........................................................39
Table 3. 10 – KNN - Confusion Matrix Test Data- K = 9...........................................................40
Table 3. 11– KNN - Confusion Matrix Test Data- K = 29..........................................................40
Table 3. 12 – KNN - Confusion Matrix Tuned Model- Test Data- K = 9....................................41
Table 3. 13 – KNN - Confusion Matrix Tuned- Final Results-Test Data...................................41
Table 3. 14– CART - Confusion Matrix Tuned- Results on Train Data.....................................46
Table 3. 15 – CART - Confusion Matrix Tuned- Results on Test Data.....................................47
Table 3. 16 – CART - Confusion Matrix Tuned- Final Results-Test Data.................................47
Table 3. 17 – Random Forest - Confusion Matrix Tuned- Results on Train Data.....................51
Table 3. 18 – Random Forest - Confusion Matrix Tuned- Results on Test Data......................51
Table 3. 19 – Random Forest - Confusion Matrix Tuned- Final Results-Test Data..................52
Table 3. 20 – Bagging - Confusion Matrix Tuned- Results on Test Data..................................54
Table 3. 21 – Bagging - Confusion Matrix Tuned- Final Results-Test Data..............................54
Table 3. 22 – Bias Vs Variance................................................................................................55
Table 3. 23 – Boosting - Confusion Matrix Tuned- Results on Test Data.................................56
Table 3. 24 – Boosting - Confusion Matrix Tuned- Final Results-Test Data.............................56
Table 3. 25 – Model Selection- Comparison Matrix..................................................................57
Page 5 of 79
Capstone-Project-Supplychain-Dataco- Final Report
LIST OF FIGURES
YFig 1. 1- Data Analytics Life Cycle
Fig 1. 2-The Business Problem Understanding........................................................................10
Fig 1. 3 - The Data Report 11
Y
Fig 2. 1- Box plot BEFORE Outlier treament............................................................................15
Fig 2. 2- Box plot AFTER Outlier treament...............................................................................15
Fig 2. 3- Correlation Plot Numeric variables- By Indicators......................................................16
Fig 2. 4- Correlation Plot Numeric variables- By Numbers.......................................................17
Fig 2. 5 - Scree Plot – Eigen Values of Components...............................................................20
Fig 2. 6 - FA Diagram – Rotation None....................................................................................21
Fig 2. 7-EDA- Data Preparation, Clearning, Imputation- Summary............................................2
Fig 3. 1 - Logistic Regression- ROC-AUC Charts....................................................................36
Fig 3. 2- KNN- Classification Method:......................................................................................39
Fig 3. 3 - CART Tree Before Pruning.......................................................................................42
Fig 3. 4- CART Complexity Parameter-Visualisation................................................................44
Fig 3. 5 - CART Pruned Tree...................................................................................................44
Fig 3. 6- CART – ROC- AUC Chart..........................................................................................48
Fig 3. 7- Random Forest Train Trees Vs Error.........................................................................49
Fig 3. 8- Random Forest Variable Importance.........................................................................50
Fig 3. 9 - Random Forest TEST- ROC Curve..........................................................................53
Page 6 of 79
Capstone-Project-Supplychain-Dataco- Final Report
ABBREVIATIONS
AUC Area Under the (ROC) Curve Diagnostic for classifier efficiency, if AUC is 1.0 is a
perfect classifier
BDA Big Data Analytics Advanced analytics technique against very large,
diverse data
CART Classification & Regression Trees Tree based methodology for prediciton
EDA Exploratory Data Analysis Approach of data analysis employs various graphical
tecnique
ROC Reciever Operating Characteristic Graphical plot as diagnostic of ability of binary classifier
SCM Supply Chain Management Handling of entire production flow of a good or service
TP/FP True Positive/False Positive TP = outcome where the model correctly predicts
the positive class. FP= Incorrectly predicts positive
class
TN/FN True Negative/Flase Negative TN = outcome where the model correctly predicts
the negative class. FN= Incorrectly predicts negative
class
Page 7 of 79
Capstone-Project-Supplychain-Dataco- Final Report
Page 8 of 79
Capstone-Project-Supplychain-Dataco- Final Report
A variety of statistical analysis techniques have been used in SCM in the areas of demand forecasting,
time series analysis and regression analysis. With advancement in information technologies and
improved computational efficiencies, Big data analytics (BDA) has emerged as a means of arriving at
more precise predictions that better reflect customer needs, facilitate assessment of Supply
Chain performance, improve the efficiency of supply chain, reduce reaction time and support
Supply Chain risk assessment.
With SCM efforts aiming at satisfying customer demand while minimising the total cost of supply,
applying Machine Learning- Data Analytics algorithms could facilitate precise (data driven) demand
forecasts and align supply chain activities with these predictions to improve efficiency and customer
satisfaction.
The above figure (Fig 1.1), explains the steps involved in data analytics life cycle. The first and
foremost step is to identify the problem and understand the business need for the study, followed by
data collection and visually interpreting the data, and later perform EDA (Exploratory Data Analysis)
which involves both data cleaning and data exploration, followed by feature engineering to identify
which are the relevant variables for the model out of large set of variables in the data set, once we are
clear with the variable and then to identify which modelling techniques to be used and once the model
is build we can evaluate the model using model evaluation techniques to find the optimal model and
provide final recommended model.
In this project we have used the approach of Data Analytics life cycle and have simulated supply chain
process of the company – Data Co. using the data set provided by the company.
Page 9 of 79
Capstone-Project-Supplychain-Dataco- Final Report
In this data set the problem identified is late delivery and a prediction model is needed to
identify if a particular product is going to reach the customer on time (or) delayed, which is
classification type of problem.
We worked on various modelling techniques which are classification oriented algorithms like logistic
regression, random forest, CART, Naïve Bayes, KNN, later models were evaluated using model
evaluation methods like Confusion Matrix, ROC, AUC etc.
Page 10 of 79
Capstone-Project-Supplychain-Dataco- Final Report
1.4 SCOPE
The Scope of this study is limited to the data set provided by Data Co. Supply chain and using the
models mentioned in the objectives
Data that is gathered had 180519 rows/records having 53 attributes / variables. The data
contained both Quantitative: Numerical Variables that have are measured on a numeric or
quantitative scale and Qualitative variable, also called a categorical variable, are variables that are
not numerical.
The Quantitiative variables can be further subgrouped as a. Discrete- Whole numbers- typically
counts e.g. Number of visits, Number of attendees b. Continuous- can take on almost any numeric
value and can be meaningfuly divided in to smaller increments, fractions, decimals e.g. height, weight,
temperature. The Qualittative variables can be further subgrouped in to a. Nominal – that do not have
natural order or ranking which are mutually exclusive e.g. zip code, gender type, b. Ordinal- Ordered
categories which are mutually exclusive ocio economic status (“low income”,” middle income”,”high
income”), education level (“high school”,”BS”,”MS”,”PhD”), income level (“less than 50K”, “50K-100K”,
“over 100K”), satisfaction rating (“extremely dislike”, “dislike”, “neutral”, “like”, “extremely like”).
The data provided has been collected for a period of 3 years on a daily basis from January 2015 to
December 2017 and January, February of 2018. The Data in this context can be categorised (or)
grouped in to 6 categories. The taxonomy of the data is represented in the below diagram (Fig 1.3) for
better understanding of the underlying data.
1.6 METHODOLOGY
The approach that was used to resolve the afore stated problem in the case study was by using
machine learning and prediction modelling techniques like Logistic Regression, Naïve Bayes, KNN,
CART, Random Forest, Ensembling Modelss and using R studio as the software tool.
Please refer to Appendix A of source code for libraries and packages of R that were used for this
case study
Considering above we cannot run ad-hoc analysis, hence there is a need to identify variables which
are important in order to evaluate the Late delivery risk.
Page 12 of 79
Capstone-Project-Supplychain-Dataco- Final Report
Further, they do not have much relevance in evaluating the Late Delivery Risk and reasons for late
delivery since these don’t contain the information of location where the product was shipped (i.e. the
store from where the product was shipped).
Also Customer Segment also has no relevance since the product has been ordered for a different
customer in a different location. We do not have customer segment information related to final end
user of the product. Hence product related information except for Product Price do not possess
predicting capability of Late delivery risk, hence were removed.
From the Bi-variate analysis performed available in the next chapter (2.2 Univariate and Bivariate
analysis) certain variables related to Order variability with respect to Late delivery risk was low, hence
removed
Hence, the following 28 variables were removed for further analysis from the given data set.
Category ID, Category Name, Customer City, Customer Country, Customer Email, Customer Fname,
Customer ID, Customer Lname, Customer Password, Customer Segment, Customer State, Customer
Street, Customer Zipcode, Department Name, Order Customer Id, Order date, Order id, Order Item
Cardprodid, Order Item Id, Order Zip code, Product Card ID, Product Category Id, Product Description,
Product Image, Product Name, Product status, Shipping date.
The remaining 25 variables (including the target variable) were taken for further analysis and model
building hence.
However, Univariate analysis does not deal with cause, relationship etc. and its major purpose is to
describe the data and summarising the patterns in the data. Univariate analysis was conducted for
both numeric and categorical variables.
In this data study there are Numerical and Categorical variables. The dependent variable can be
assessed as both numeric and categorical.
In this data study there are Numerical and Categorical variables. The dependent variable is a
Categorical variable and the Independent variables are both Numeric and Categorical.
Page 13 of 79
Capstone-Project-Supplychain-Dataco- Final Report
Days for shipping actual Right skewed, no outliers Less correlation with other Variable can be considered for model
independent variables building
Days for shipping scheduled Right skewed, no outliers Less correlation with other Variable can be considered for model
independent variables building
Benefits per order Left skewed, many outliers High correlation with order item Outlier treatment and multi
profit ratio, order profit per order collinearity treatment needed
Sales per customer Right skewed, many outliers High correlation with Sales, Outlier treatment and multi
product price, order item product collinearity treatment needed
price
Order Item discount Left skewed, no outliers Less correlation with other Variable can be considered for model
independent variables building
Order item product price Right skewed, few outliers High correlation with Product Outlier treatment and multi
Price collinearity treatment needed
Order item profit ratio Left skewed, many outliers High correlation with order item Outlier treatment and multi
profit per order collinearity treatment needed
Sales Right skewed, few outliers High correlation with Sales per Outlier treatment and multi
customer and order item total collinearity treatment needed
Order item total Right skewed, many outliers High correlation with product Outlier treatment and multi
price collinearity treatment needed
Order profit per order Left skewed, many outliers High correlation with order item Outlier treatment and multi
profit ratio collinearity treatment needed
Product Price Right skewed, few outliers High correlation with order item Outlier treatment and multi
product price collinearity treatment needed
Type Debit-38% highest, By Cash- Correlated to dependent variable Lesser Cash, considered for model
11% less building
Delivery status Late Delivered 55% Associated with late delivery risk Not considered for model building
Late Delivery risk Risk -55% Is the Dependent variable Risk is high, mitigation needed. Is the
dependent variable
Product status Availability 100% Lesser influence on the Better Inventory, product related no
dependent variable influence on dependent variable
Order status 56% of orders are Open Correlated to dependent variable Expect payment delays, considered
for model building
Shipping mode 60% standard class, 20% Correlated to dependent variable Efficient supply chain needed,
faster delivery considered for model building
Customer City, country - Customer City and country are Not considered for model building
highly correlated
Order city, country, region - Order city, country and region are Not considered for model building
highly correlated
Page 14 of 79
Capstone-Project-Supplychain-Dataco- Final Report
Box plot shows there are outliers in most of the numeric variables. Since the logistic regression
models are sensitive to outliers, hence the outliers were treated by capping technique using central
tendency as median.
Box plots presented below Before and After Outlier treatment presented in (Fig 2.1) & (Fig 2.2)
Page 15 of 79
Capstone-Project-Supplychain-Dataco- Final Report
Disadvantages of Multicollinearity:-
For Regression Multicolinearity is a problem because
a. If two independent variables contain essentially same information to a large extent, one may
become insignificant (or) may become significant
b. Unstable estimates as it tends to increase the variances of regression coefficients
Advantages of Multicollinearity:-
For PCA (Principal Component Analysis) and FA (Factor Analysis) multicollinearity is an advantage as
it helps to reduce the dimension of the variables since the variables are correlated.
# Numerical Vs Numerical#
Following Multi-variate analysis were performed for this data set represented in (Fig 2.3) & (Fig 2.4)
a. Correlation Study
b. Multicollinearity Checks
Page 16 of 79
Capstone-Project-Supplychain-Dataco- Final Report
Inference: -
Correlation study using Correlation plot shows presence of correlated independent variable.
- Benefit per order, order item profit ratio, Order Profit per order are highly correlated.
- Sales is highly correlated with Sales per customer, Order Item total.
- Order item product price is highly correlated with Product price.
- Sales per customer and Sales are highly correlated.
There are correlated predictor/independent variables existing in this data set, which will lead to
situation of Multicollinearity, which may impact accuracy of the prediction and the coefficients to
identify the variable importance.
The suggested remedial measures was to treat Multi Collinearity, methods of treatment are :-
Remove some of highly correlated variables using VIF
Standardise the values by subtracting the means
Can perform PCA (Principal Component Analysis) /FA (Factor Analysis) to
reduce the dimension of correlated independent variables.
For this data set clustering technique PCA/FA was performed to reduce the dimension of correlated
independent variables which is covered in next section- Data Preparation.
Page 17 of 79
Capstone-Project-Supplychain-Dataco- Final Report
data: tab2
X-squared = 29060235, df = 586148, p-value < 2.2e-16
P is low, one of the variable could be dropped
Order Country Vs Order Region
chisq.test(tab3)
data: tab3
X-squared = 3429861, df = 3586, p-value < 2.2e-16
P is low, one of the variable could be dropped
Order Country Vs Order State
> chisq.test(tab4)
data: tab4
X-squared = 28301214, df = 177344, p-value < 2.2e-16
P is low, one of the variable could be dropped
Feature scaling in machine learning is one of the most critical steps during the pre-processing of data
before creating a machine learning model. Scaling can make a difference between a weak machine
learning model and a better one.
The most common techniques of feature scaling are Normalization and Standardization. Normalization
is used when we want to bound our values between two numbers, typically, between [0,1] or [-1,1].
While Standardization transforms the data to have zero mean and a variance of 1, they make our
data unit less.
Machine learning algorithm just sees number — if there is a vast difference in the range say few
ranging in thousands and few ranging in the tens, and it makes the underlying assumption that higher
ranging numbers have superiority of some sort. So these more significant number starts playing a
more decisive role while training the model. The machine learning algorithm works on numbers and
does not know what that number represents. A weight of 10 grams and a price of 10 dollars represents
completely two different things — which is a no brainer for humans, but for a model as a feature, it
treats both as same.
Page 18 of 79
Capstone-Project-Supplychain-Dataco- Final Report
1. K-Nearest Neighbour (KNN) with a Euclidean distance measure is sensitive to magnitudes and
hence should be scaled for all features to weigh in equally.
2. Scaling is critical while performing Principal Component Analysis(PCA). PCA tries to get the
features with maximum variance, and the variance is high for high magnitude features and skews
the PCA towards high magnitude features
3. Helps to speed up gradient descent by scaling because θ descends quickly on small ranges and
slowly on large ranges, and oscillates nefficiently down to the optimum when the variables are
very uneven
Algorithms that do not require normalization/scaling are the ones that rely on rules. They would not
be affected by any monotonic transformations of the variables. Scaling is a monotonic transformation.
Examples of algorithms in this category are all the tree-based algorithms — CART, Random Forests,
Gradient Boosted Decision Trees. These algorithms utilize rules (series of inequalities) and do not
require normalization.
Scaling was performed to numerical data subset and output of scaling reflected below (Table 2.3)
Data Balancing :-
What is Balanced and Imbalanced Datasets?
Balanced dataset:
Let us take simple example in a dataset we have positive and negative values. If the positive values
and equal to negative values, then we can say the dataset is balanced
Imbalanced dataset:
In the same example if there is very high difference between positive and negative values then the
data set is imbalance data set
In the Data Co. data set the distribution of target/dependent variable distribution is 0’s (no risk of
late delivery) – 45.16% 1’s (Late delivery risk) – 54.84%. Hence this data is Balanced dataset.
It is noteworthy this would be the baseline i.e. without model/algorithms DataCo. Company knows
from the existing that data that 54.84% is late delivery risk.
Page 19 of 79
Capstone-Project-Supplychain-Dataco- Final Report
In above (Fig- 2.5) Eigen values as output of PCA-FA was plotted called the Scree plot.
Factor analysis using the FA method yeilds the below results, which is unrotated i.e. the Factors are
Orthagonal to each other given in the below (Table 2.4)
Page 20 of 79
Capstone-Project-Supplychain-Dataco- Final Report
Interpretation:-
The first 6 factors explains 79% of the variance i.e. we can reduce the dimension from 15 to 6, while
losing 21% of variance. Factor 1 accounts for 33%, Factor 2 accounts for 17%, Factor 3 – 12%,
Factor 4- 11% variance, Factor 5 and 6 both account for 9% of variance
Further the FA could be studied visually through FA diagram represented below in (Fig 2.6) and the
respective Labels of the factors is presented in (Table 2.5)
Page 21 of 79
Capstone-Project-Supplychain-Dataco- Final Report
MR1 Sales per customer, Order Item total, Sales, Revenue These are related to sales generated
Product Price, Order Item Product Price (5 hence labelled as Revenue
variables)
MR2 Order item profit ratio, Benefit per order, Order Profit These are related to profits generated
Profit per order (3 variables) hence labelled as Profit
MR3 Order Item Quantity, Order Item discount (2 Quantity These are related item quantity hence
variables) labelled as Quantity
MR4 Order Item Discount Rate (1 Variable) Discount These are related to Discounts provided,
hence labelled as Discount
MR5 Latitude and Longitude (2 Variables) Location Geospatial variables hence labelled as
Location
MR6 Days for shipment scheduled, Days for shipment Schedul Both the variables are days of shipment
real (2 variables) e hence labelled as Schedule
Page 22 of 79
Capstone-Project-Supplychain-Dataco- Final Report
Deciling method was used with the features (or) Factors identified from the previous step to assess the
distribution of feature and found all Factors have good distribution up to 10 deciles, hence considered
for as variable for model building. The output of the deciles are shown below
MR1- Revenue: -
There are 10 deciles, hence the variable represents significant distribution and good predictor.
MR2- Profit: -
There are 10 deciles, hence the variable represents significant distribution and good predictor.
MR3- Quantity: -
There are 10 deciles, hence the variable represents significant distribution and good predictor.
MR4- Discount: -
There are 10 deciles, hence the variable represents significant distribution and good predictor.
MR5- Location: -
MR6- Schedule: -
There are 10 deciles, hence the variable represents significant distribution and good predictor.
Categorical Features: -
Referring to Bi variate analysis between categorical independent features and dependent categorical
feature, differences observed. However, some of them are correlated, hence uncorrelated categorical
variables will be selected for the model building.
The selected features/variables along with dependent variable i.e. Late delivery risk was split in to
Train and Test data on 70/30 ratio.
Page 23 of 79
Capstone-Project-Supplychain-Dataco- Final Report
The predictive models that were built for this case study are using Logistic Regression, Naive Bayes,
KNN of predictive model techniques. Ensemble methods like Bagging and Boosting were also used to
create models, post the model development interpretation of the model outputs, necessary
modifications like tuning the parameters were done to find the optimal model outputs..
The output/results of all the models were evaluated using model performance validation techniques
like Confusion Matrix, ROC, AUC, GINI index (whereever applicable) and the scores were compared
to arrive at the best performed model that can predict Late Delivery Risk.
In statistical machine learning techniques there is problem of data overfitting i.e. Overfitting a model
is a condition where a statistical model begins to describe the random error in the data rather than
the relationships between variables. This problem occurs when the model is too complex. The
problem of overfitting can be avoided by spitting the data in to Training and Test data.
To explain the overfitting bit further, for example- Let us consider that you want to teach your dog a
few tricks - sit, stay, roll over. You can achieve this by giving the command and showing your dog what
the dog needs to do when you say this command i.e. training data. If you provide your dog with
enough clear instructions on what he is supposed to learn, your dog might reach a point where he
obeys your command almost every time i.e. high training accuracy. You can brag in a dog show may
be that your dog can perform a lot of tricks. However, will your dog do the correct thing in the show if
you give the command i.e. testing data? If your dog rolls over when the instruction in the show is to sit,
it might mean that your dog is only good at performing a trick when you i.e. training data give the
command - low testing accuracy. This is an example of overfitting.
The reasons for why your dog only responds in the correct manner when you give the command can
vary, but it comes down to your training data.
If the training accuracy is high, but the testing accuracy is low, the model cannot be advertised as a
good model. Testing data allows you to test your model on data that is independent of your training
data. If the model is actually a good model i.e. performing the correct command in this case, it
should perform just as well on training data as well in the testing data.
Page 24 of 79
Capstone-Project-Supplychain-Dataco- Final Report
This report covers the model build and evaluation that were performed with Train and Test data that
was produced from Section 2. This report is covers the model build and evaluation in below
sequence:-
3.1 Applying Logistic Regression, Model Tuning, Model Evaluation & Interpret results
3.2 Applying Naive Bayes, Model Tuning, Model Evaluation & Interpret results
3.3 Applying KNN – K Nearest Neighbour Model, Model Tuning, Model Evaluation & Interpret results
3.4 Applying CART, Model Tuning, Model Evaluation & Interpret results
3.5 Appying Random Forest, Model Tuning, Model Evaluation & Interpret results
3.6 Applying Bagging method, Model Tuning, Model Evaluation & Interpret results
3.7 Applying Boosting method Model Tuning, Model Evaluation & Interpret results
3.8 Model Validation to find which above model performed the best
Logistic Regression is a statistical model that in its basic form uses a logistic function (in statistics
logistics model (or logit model) is used to model the probability of a certain class (or) event such as
pass/fail, win/lose to a model of binary dependent variable). In regression analysis, logistic regression
is estimating the parameters of a logistic model (a form of binary regression). Mathematically, a binary
logistic model has a dependent variable with two possible values e.g. pass/fail, where the two values
are labelled as “0” and “1”.
In the logistic model, the log odds for the value labelled 1 is a linear combination of one (or) more
independent variables or predictors. The independent variables can be binary (or) continuous
variables. The corresponding probability of the value labelled "1" can vary between 0 (certainly the
value "0") and 1 (certainly the value "1"), hence the labelling; the function that converts log-odds to
probability is the logistic function, hence the name.
The unit of measurement for the log-odds scale is called a logit, from logistic unit, hence the
alternative names.
The Algorithm: -
Logistic Regression is a part of a larger class of algorithm called the Generalized Linear Model (glm). It
is a classification algorithm used to predict binary outputs. One of the reason for Logistic regression to
be used is to get the probabilities of occurrences meaning the 0 < p < 1. The probability does not vary
linearly.
Logistic Regression with given data set:
In the data preparation step from previous section (Section 2) we split the data in to Train and Test
data sample and the proportion of target variable identified to be balanced data.
Logistic Regression was applied to the Training data to build the model and the model that was
prepared with Train data was applied to Test data to derive the predictions.
a. Forward selection, which involves starting with no variables in the model, testing the addition of
each variable using a chosen model fit criterion, adding the variable (if any) whose inclusion gives the
Page 25 of 79
Capstone-Project-Supplychain-Dataco- Final Report
most statistically significant improvement of the fit, and repeating this process until none improves the
model to a statistically significant extent.
b. Backward elimination, which involves starting with all candidate variables, testing the deletion of
each variable using a chosen model fit criterion, deleting the variable (if any) whose loss gives the
most statistically insignificant deterioration of the model fit, and repeating this process until no further
variables can be deleted without a statistically insignificant loss of fit.
c. Bidirectional elimination, a combination of the above, testing at each step for variables to be
included or excluded.
In this dataset approach c i.e. Bidirectional approach was followed to construct the logistic
regression model.
There are certain key assumptions that Logistic regression as a model carries, which were to be
considered for the model building i.e. Logistic regression does not make many of the key assumptions
of linear regression and general linear models that are based on ordinary least squares algorithms –
particularly regarding linearity, normality, homoscedasticity, and measurement level as defined below.
1) Logistic regression does not require a linear relationship between the dependent and
independent variables.
2) The error terms (residuals) do not need to be normally distributed.
3) Homoscedasticity is not required.
4) The dependent variable in logistic regression is not measured on an interval or ratio scale.
We built various models with the Train data set using the Bidirectional approach, which is detailed
below:
Call:
glm(formula = Late_delivery_risk ~ TypeCASH + TypeDEBIT + TypePAYMENT +
TypeTRANSFER, family = "binomial", data = SCM_train)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.310 -1.302 1.051 1.058 1.203
Page 26 of 79
Capstone-Project-Supplychain-Dataco- Final Report
Inference: -
There are only 3 significant variables identified can be considered to final model, but the variable
TypeTRANSFER1 was correlated with others, hence can be ignored.
Call:
glm(formula = Late_delivery_risk ~ MarketAfrica + MarketEurope +
MarketLATAM + MarketPacific.Asia + MarketUSCA, family = "binomial",
data = SCM_train)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.270 -1.257 1.088 1.100 1.102
# LR Model 4- Check the Predictor Shipping Mode influence on the dependent variable: -
The model was built using character predictor- Shipping Mode which was converted to dummy,
constructing logistic regression and output of this model shown below
LRmodel4 <- glm(Late_delivery_risk ~ Shipping.ModeFirst.Class +
Shipping.ModeSame.Day +Shipping.ModeSecond.Class
+Shipping.ModeStandard.Class, data = SCM_train , family= "binomial")
> summary(LRmodel4)
Page 27 of 79
Capstone-Project-Supplychain-Dataco- Final Report
Call:
glm(formula = Late_delivery_risk ~ Shipping.ModeFirst.Class +
Shipping.ModeSame.Day + Shipping.ModeSecond.Class +
Shipping.ModeStandard.Class,
family = "binomial", data = SCM_train)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.4734 -0.9796 0.3101 1.2429 1.3890
Inference: -
There are only 3 significant variables identified can be considered to the final model, but the variable
Shipping modeStandard Class was correlated with others, hence can be ignored.
# LR Model 5- Check the Predictor Order status influence on the dependent variable: -
The model was built using character predictor- Order Status which was converted to dummy,
constructing logistic regression and output of this model shown below
> summary(LRmodel5)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 5.618e+09 7.215e+09 0.779 0.436
Order.StatusCANCELED1 -5.618e+09 7.215e+09 -0.779 0.436
Order.StatusCLOSED1 -5.618e+09 7.215e+09 -0.779 0.436
Order.StatusCOMPLETE1 -5.618e+09 7.215e+09 -0.779 0.436
Order.StatusON_HOLD1 -5.618e+09 7.215e+09 -0.779 0.436
Order.StatusPAYMENT_REVIEW1 -5.618e+09 7.215e+09 -0.779 0.436
Order.StatusPENDING1 -5.618e+09 7.215e+09 -0.779 0.436
Order.StatusPENDING_PAYMENT1 -5.618e+09 7.215e+09 -0.779 0.436
Order.StatusPROCESSING1 -5.618e+09 7.215e+09 -0.779 0.436
Order.StatusSUSPECTED_FRAUD1 -5.618e+09 7.215e+09 -0.779 0.436
# LR Model: -
This model was constructed with significant variables/predictors identified from before steps.
#LR Model1 with Selected Predictors:-
Page 28 of 79
Capstone-Project-Supplychain-Dataco- Final Report
Call:
glm(formula = Late_delivery_risk ~ Revenue + Profit + Quantity +
Discount + Location + Schedule + TypeCASH + TypeDEBIT + TypePAYMENT +
Shipping.ModeFirst.Class + Shipping.ModeSame.Day +
Shipping.ModeSecond.Class,
family = "binomial", data = SCM_train)
Deviance Residuals:
Min 1Q Median 3Q Max
-5.3030 -0.0467 0.0049 0.1949 1.2090
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -12.328521 0.092342 -133.509 < 2e-16 ***
Revenue 0.036371 0.013782 2.639 0.00831 **
Profit -0.105957 0.013750 -7.706 1.3e-14 ***
Quantity -0.007623 0.014386 -0.530 0.59618
Discount 0.118884 0.014440 8.233 < 2e-16 ***
Location 5.808677 0.044515 130.488 < 2e-16 ***
Schedule 14.794056 0.106225 139.271 < 2e-16 ***
TypeCASH1 1.952686 0.053587 36.440 < 2e-16 ***
TypeDEBIT1 1.962636 0.036125 54.329 < 2e-16 ***
TypePAYMENT1 1.944444 0.041642 46.694 < 2e-16 ***
Shipping.ModeFirst.Class1 31.867099 0.213845 149.020 < 2e-16 ***
Shipping.ModeSame.Day1 40.673871 0.295945 137.437 < 2e-16 ***
Shipping.ModeSecond.Class1 19.857350 0.146058 135.955 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Page 29 of 79
Capstone-Project-Supplychain-Dataco- Final Report
> summary(LRmodel_Draft)
Call:
glm(formula = Late_delivery_risk ~ Revenue + Profit + Discount +
Location + Schedule + TypeCASH + TypeDEBIT + TypePAYMENT +
Shipping.ModeFirst.Class + Shipping.ModeSame.Day +
Shipping.ModeSecond.Class,
family = "binomial", data = SCM_train)
Deviance Residuals:
Min 1Q Median 3Q Max
-5.3041 -0.0467 0.0049 0.1950 1.2027
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -12.32808 0.09233 -133.515 < 2e-16 ***
Revenue 0.03648 0.01378 2.648 0.0081 **
Profit -0.10596 0.01375 -7.705 1.31e-14 ***
Discount 0.11932 0.01442 8.274 < 2e-16 ***
Location 5.80852 0.04451 130.491 < 2e-16 ***
Schedule 14.79351 0.10622 139.278 < 2e-16 ***
TypeCASH1 1.95264 0.05359 36.439 < 2e-16 ***
TypeDEBIT1 1.96259 0.03612 54.328 < 2e-16 ***
TypePAYMENT1 1.94437 0.04164 46.693 < 2e-16 ***
Shipping.ModeFirst.Class1 31.86607 0.21383 149.027 < 2e-16 ***
Shipping.ModeSame.Day1 40.67226 0.29592 137.445 < 2e-16 ***
Shipping.ModeSecond.Class1 19.85662 0.14605 135.962 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
This shows the presence of multicollinearity between Schedule, Shipping Mode. Hence Schedule
was removed and another model LR model was built, which was the final model.
# LR FINAL1 Model: -
This is the Final model constructed with significant and important variables/predictors which are not
correlated amongst each other that were identified from before steps and later again VIF TEST was
run to identify for the presence of multicollinearity.
Summary output of the FINAL1 Model.
> summary(LRmodel_FINAL1)
Call:
glm(formula = Late_delivery_risk ~ Revenue + Profit + Discount +
Location + TypeCASH + TypeDEBIT + TypePAYMENT +
Shipping.ModeFirst.Class +
Shipping.ModeSame.Day + Shipping.ModeSecond.Class, family = "binomial",
data = SCM_train)
Deviance Residuals:
Page 30 of 79
Capstone-Project-Supplychain-Dataco- Final Report
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.907046 0.013453 -67.423 <2e-16 ***
Revenue -0.009930 0.006464 -1.536 0.1245
Profit -0.010064 0.006466 -1.556 0.1196
Discount 0.013869 0.006735 2.059 0.0395 *
Location 0.389395 0.008287 46.991 <2e-16 ***
TypeCASH1 0.450136 0.023153 19.442 <2e-16 ***
TypeDEBIT1 0.455922 0.016196 28.151 <2e-16 ***
TypePAYMENT1 0.472174 0.018228 25.904 <2e-16 ***
Shipping.ModeFirst.Class1 3.802061 0.035566 106.902 <2e-16 ***
Shipping.ModeSame.Day1 0.695076 0.026930 25.810 <2e-16 ***
Shipping.ModeSecond.Class1 1.819376 0.017467 104.160 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Similarly, other variable coefficients can be interpreted. The table of coefficients are fit indices,
including the null and deviance residuals and the AIC in the bottom of the summary
VIF Test
A test of VIF-Variable Inflation Factor to test the presence of multicollinearity for LR Model 2 yielded
below result
Page 31 of 79
Capstone-Project-Supplychain-Dataco- Final Report
VIF is less than 2, hence the correlation amongst the independent variables are low.
Profiling: -
We can use the confint function to obtain confidence intervals for the coefficient estimates. Note that
for logistic models, confidence intervals are based on the profiled log-likelihood function
2.5 % 97.5 %
(Intercept) -0.9334482715 -0.880712656
Revenue -0.0225995839 0.002737279
Profit -0.0227368997 0.002609933
Discount 0.0006689307 0.027071384
Location 0.3731634302 0.405646370
TypeCASH1 0.4047626039 0.495523418
TypeDEBIT1 0.4241919634 0.487678459
TypePAYMENT1 0.4364595470 0.507910977
Shipping.ModeFirst.Class1 3.7329512689 3.872396435
Shipping.ModeSame.Day1 0.6422843537 0.747852560
Shipping.ModeSecond.Class1 1.7852086671 1.853679721
FALSE TRUE
0 48808 8271
1 23523 45761
FALSE TRUE
0 20884 3579
1 10024 19669
Page 32 of 79
Capstone-Project-Supplychain-Dataco- Final Report
Prediction>0.5 LR-Predict
0- No Late Delivery 1- Late Delivery
0- No Late Delivery 48808 8271
Late Delivery 1- Late Delivery 23523 45761
Accuracy- 74.84%, Sensitivity (or) Recall- 66.05%, Specificity-85.51%, Precision –
84.69%
Call:
glm(formula = Late_delivery_risk ~ Location + TypeCASH + TypeDEBIT +
TypePAYMENT + Shipping.ModeFirst.Class + Shipping.ModeSame.Day +
Shipping.ModeSecond.Class, family = "binomial", data = SCM_train)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.5110 -0.9612 0.2909 1.0312 1.8130
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.906838 0.013452 -67.41 <2e-16 ***
Location 0.389347 0.008286 46.99 <2e-16 ***
TypeCASH1 0.449965 0.023153 19.43 <2e-16 ***
TypeDEBIT1 0.455915 0.016195 28.15 <2e-16 ***
TypePAYMENT1 0.472212 0.018227 25.91 <2e-16 ***
Shipping.ModeFirst.Class1 3.801198 0.035563 106.89 <2e-16 ***
Shipping.ModeSame.Day1 0.695046 0.026928 25.81 <2e-16 ***
Shipping.ModeSecond.Class1 1.819270 0.017466 104.16 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Page 33 of 79
Capstone-Project-Supplychain-Dataco- Final Report
> CM_SCM_Train_FT
FALSE TRUE
0 49348 7731
1 22880 46404
CM_SCM_FT
FALSE TRUE
0 21117 3346
1 9740 19953
There is 1% improvement in the results, however no big differences observed, hence the other options
to improve the model from here is to adjust the prediction variation, but this would lead in to trade-off
between Recall and Precision. Hence, no further fine tuning was done.
Model Evaluation: -
The measure of performance for predictive models – Logistic Regression, as evaluated
through below methods: -
a. Confusion Matrix: - For the class output from the models classification error of
predicted vs actuals were drawn to understand the Accuracy- The ratio of
classifications that were done correctly and Sensitivity i.e. (proportion of total
positives that were correctly identified of the model.
Page 34 of 79
Capstone-Project-Supplychain-Dataco- Final Report
b. ROC/AUC Curves- With probability outputs of the prediction ROC curve- Receiver
Operating characteristic was drawn.
Confusion matrix and Interpretation
When the train model was applied in test data with threshold of 0.5, below were the results.
The final performance from the results of Logistic regression model presented below: -
Interpretation: -
The Logistic regression model has given has given accuracy of 75.84% with Recall of 85.64% and
Precision of 68.44%, F-Measure Harmonic Mean- 76.08%
----------------------------------------------------------------------------------------------------------------------------------------
Definition of evaluation parameters: -
Before we jump in to interpretation of the results, it is important to understand the what the measure
means, which is explained below.
Accuracy = Out of all cases how much did we correctly predict = (TP+TN)/(TP+TN+FP+FN)
Sensitivity/Recall = Out of all positive cases how many you are able to predict correctly i.e. how
good the test is detecting positive cases= TP / (TP +FN)
Specificity = Out of all negative cases predicted, how many are predicted correctly i.e. how good
the test is avoiding false alarms = TN / (TN+FP)
Precision = How many of the positively classified were relevant = TP/(TP+FP)
F Measure = Measure of Precision and Recall at same time = Harmonic Mean =
2*Recall*Precision / (Recall+Precision)
----------------------------------------------------------------------------------------------------------------------------------------
Page 35 of 79
Capstone-Project-Supplychain-Dataco- Final Report
Sensitivity/Recall is i.e. model is able to spot the late delivery up to 67% and Specificity which non-late
delivery prediction is 86%. Though the sensitivity is low the precision at which positive cases are
identified is 85%
Since the objective is to reduce the Late delivery, Sensitivity/Recall of predicting True Positive (TP =
outcome where the model correctly predicts the positive class) is of prime importance. This model is
able to predict only 67% and false alarm is yet fine. The Model result found to be Satisfactory. The
model can be fine-tuned to improve the sensitivity by reducing the threshold, but this will impact the
Accuracy and Precision, hence the advice to business is to evaluate the business situation and adjust
the threshold to improve Sensitivity (or) Specificity
ROC/AUC/KS Charts
Logistic Regression- ROC/AUC Charts - Test Data
For classification problems with probability outputs, a threshold can convert probability outputs to
classifications. This choice of threshold can change the confusion matrix and as the threshold
changes, a plot of the false positive Vs true positive is called the ROC curve- Receiver Operating
Characteristic. AUC – Area under curve. It is one of the most important evaluation metrics for
checking any classification model’s performance. It is also written as AUROC- Area under the receiver
operating characteristics.
ROC is a probability curve and AUC represents degree or measure of separability. Higher the AUC,
better the model is at predicting 0s as 0s and 1s as 1s.
The results from the Logistic regression model was reviewed with ROC-AUC parameters and the
model evaluation presented in (Fig- 3.1)
AUC:-
> as.numeric(performance(ROCRpred_LR, "auc")@y.values)
[1] 0.7751119
Interpretation: -
When the train model was applied to test data produced ROC curve curved towards the left (True
positive rate), this indicates good proportion of data expected to be predicted correctly.
The threshold range from 0 to 1, it indicates lower threshold may give better curve, we can retain the
0.5 threshold as it has yields TPR close to 67%. The thresholds can be lowered to 0.4 to improve
sensitivity, on the contrary it will impact the accuracy and specificity of the model, hence the advice to
Page 36 of 79
Capstone-Project-Supplychain-Dataco- Final Report
business is to evaluate the business situation and adjust the threshold to improve Sensitivity (or)
Specificity. AUC is 77.5%. The model results are satisfactory
Since there are more than 10 predictors variables and sample size needed in exponential, we would
need good amount of samples size. Since this data set has good amount of data, model was built with
training data and predicted on the test data
Naïve Bayes Algorithm was used on training data and to predict the test data yielded the following
results.
Page 37 of 79
Capstone-Project-Supplychain-Dataco- Final Report
The measure of performance for predictive models Naïve Bayes, evaluated through method of
Confusion Matrix: - For the class output from the models classification error of predicted vs actuals
were drawn to understand the Accuracy- The ratio of classifications that were done correctly and
Sensitivity i.e. (proportion of total positives that were correctly identified of the model.
Confusion matrix and Interpretation
Interpretation: -
----------------------------------------------------------------------------------------------------------------------------------------
Definition of evaluation parameters: -
Before we jump in to interpretation of the results, it is important to understand the what the measure
means, which is explained below.
Accuracy = Out of all cases how much did we correctly predict = (TP+TN)/(TP+TN+FP+FN)
Sensitivity/Recall = Out of all positive cases how many you are able to predict correctly i.e. how
good the test is detecting positive cases= TP / (TP +FN)
Specificity = Out of all negative cases predicted, how many are predicted correctly i.e. how good
the test is avoiding false alarms = TN / (TN+FP)
Precision = How many of the positively classified were relevant = TP/(TP+FP)
F Measure = Measure of Precision and Recall at same time = Harmonic Mean =
2*Recall*Precision / (Recall+Precision)
----------------------------------------------------------------------------------------------------------------------------------------
Sensitivity/Recall is i.e. Model is able to spot the late delivery up to 71% and Specificity which non-late
delivery prediction is 81%.
Since the objective is to reduce the Late delivery, Sensitivity/Recall of predicting True Positive (TP =
outcome where the model correctly predicts the positive class) is of prime importance. Which the
model is able to predict 71% and false alarm is yet fine. The Model result is Satisfactory.
Page 38 of 79
Capstone-Project-Supplychain-Dataco- Final Report
KNN- also called as K Nearest Neighbour is a non-parametric, lazy learning algorithm. The very
purpose of KNN is to use a data base in which the data points are separated in to several classes to
predict the classification of a new sample point
When we say the technique is non-parametric, it means that it does not make any assumptions on the
underlying data distribution. In other words, the model structure is determined from the data. If you
think about it, it’s pretty useful, because in the “real world”, most of the data does not obey the typical
theoretical assumptions made (as in linear regression models, for example). Therefore, KNN could
and probably one of the good choices for a classification study when there is little or no prior
knowledge about the distribution data.
KNN requires size of training set that exponentially increases with the number of predictors. This is
because expected distance to nearest neighbour increases with p (with large vector of predictors, all
records end up “far away from each other). If the training set is long it takes time to find all distances.
This constitute to curse dimensionality
The Algorithm: -
KNN is also a lazy algorithm- What this means is that it does not use the training data points to do
any generalization. KNN Algorithm is based on feature similarity i.e. How closely out-of-sample
features resemble our training set determines how we classify a given data point as represented in
(Fig 3.2):
KNN is used for classification. The output is a class membership- predict class or a discrete value. An
object is classified by a majority of vote of the neighbours, with the object being assigned to the class
most common among its K neighbours. It can also be used for regression — output is the value for
the object (predicts continuous values). This value is the average (or median) of the values of its k
nearest neighbours.
KNN with given data set: Model Building and Model Tuning
This data was split in to Training and Testing Data and first the KNN model was applied to the Training
data and prediction was constructed on trial and error method of adjusting the K parameter (i.e.
tuning). The output of the KNN with various K parameters listed below:
We did KNN model with various K parameters with output as below.
# Model 1: K = 19
> # Model 1
> SCM.KNN. = knn (scale(SCM.train.num), scale(SCM.test.num), cl =
SCM_train[,1], k=19, prob = TRUE)#K is 19
Page 39 of 79
Capstone-Project-Supplychain-Dataco- Final Report
> # Model 1
> SCM.KNN = knn (scale(SCM.train.num), scale(SCM.test.num), cl =
SCM_train[,1], k=19, prob = TRUE)#K is 19
> SCM.tabKNN = table(SCM_test$Late_delivery_risk, SCM.KNN)
> SCM.tabKNN
SCM.KNN
0 1
0 21911 2552
1 6542 23151
# Model 2: K = 9
> # Model 2
> SCM.KNN2 = knn (scale(SCM.train.num), scale(SCM.test.num), cl =
SCM_train[,1], k=9, prob = TRUE)#K is 9
> SCM.tabKNN2 = table(SCM_test$Late_delivery_risk, SCM.KNN2)
> SCM.tabKNN2
SCM.KNN2
0 1
0 21931 2532
1 5902 23791
# Model 3: K = 29
> # Model 3
> SCM.KNN3 = knn (scale(SCM.train.num), scale(SCM.test.num), cl =
SCM_train[,1], k=29, prob = TRUE)#K is 29
> SCM.tabKNN3 = table(SCM_test$Late_delivery_risk, SCM.KNN3)
> SCM.tabKNN3
SCM.KNN3
0 1
0 21738 2725
1 6758 22935
Page 40 of 79
Capstone-Project-Supplychain-Dataco- Final Report
K=29 KNN-Predict
0- No Late Delivery 1- Late Delivery
0- No Late Delivery 21738 2725
Late Delivery 1- Late Delivery 6758 22935
Accuracy- 82.49%, Sensitivity (or) Recall- 77.24%, Specificity- 88.86%, Precision –
89.38%, F Measure- 82.64%
Model Tuning: -
Increasing the K to 29 found to reduce the accuracy, sensitivity and precision, on the contrary
decreasing the K to 9 produced better results than K=19. Model 2 found to be improving the
accuracy, Sensitivity
Model Prediction (Train model prediction on Test Data)
With the above KNN model built on the training data we did model prediction on the test data train
data i.e. if we have to randomly pick an element in the node and what would be its classification with
respect to customer churn.
Various K Parameters were tried and finally concluded with reduction of K may improve the sensitivity,
we can reduce the K yet further, but this will include noise. Hence, recommendation is concluding
Model 2 as the optimal Model with K = 9
Model Evaluation: -
The measure of performance for predictive model KNN, evaluated through method of
Confusion Matrix: - For the class output from the models classification error of predicted vs actuals
were drawn to understand the Accuracy- The ratio of classifications that were done correctly and
Sensitivity i.e. (proportion of total positives that were correctly identified of the model).
Confusion matrix and Interpretation
Interpretation: -
----------------------------------------------------------------------------------------------------------------------------------------
Page 41 of 79
Capstone-Project-Supplychain-Dataco- Final Report
Accuracy = Out of all cases how much did we correctly predict = (TP+TN)/(TP+TN+FP+FN)
Sensitivity/Recall = Out of all positive cases how many you are able to predict correctly i.e. how
good the test is detecting positive cases= TP / (TP +FN)
Specificity = Out of all negative cases predicted, how many are predicted correctly i.e. how good
the test is avoiding false alarms = TN / (TN+FP)
Precision = How many of the positively classified were relevant = TP/(TP+FP)
F Measure = Measure of Precision and Recall at same time = Harmonic Mean =
2*Recall*Precision / (Recall+Precision)
----------------------------------------------------------------------------------------------------------------------------------------
Sensitivity/Recall is i.e. Model is able to spot the late delivery up to 80%% and Specificity which non-
late delivery prediction is 89%.
Since the objective is to reduce the Late delivery, Sensitivity/Recall of predicting True Positive (TP =
outcome where the model correctly predicts the positive class) is of prime importance. Which the
model is able to predict 71% and false alarm is yet fine. The Model result is Satisfactory. It is
noteworthy to understand that KNN works well for continuous variables.
CART is abbreviated as Classification and Regression Trees is Supervised (Supervised means the
Target to be achieved is known) Machine Learning Technique to build Prediction Model. These are
decision trees, that segment the data space in to smaller regions, which can be called as tree and end
node has a decision – either Classification (or) Regression.
The Algorithm:
The algorithm of constructing decision trees works top-down, by choosing a variable at each step that
best splits the set of items in the data. The success is measured by how similar the data inside the
node that is split. Hence, larger the impurity lesser the accuracy of the prediction.
This data was split in to Training and Testing Data and first the CART model was applied to the
Training data and CART tree was constructed on trial and error method before pruning. The output of
the CART tree displayed below (Fig 3.3): -
Page 42 of 79
Capstone-Project-Supplychain-Dataco- Final Report
The Tree is complex since there are many predictors, hence could not yield better visualisation
CP also called as Cost Complexity chart below for the above tree.
Classification tree:
rpart(formula = SCM_train$Late_delivery_risk ~ Revenue + Profit +
Discount + Location + Schedule + TypeCASH + TypeDEBIT + TypePAYMENT +
Shipping.ModeFirst.Class + Shipping.ModeSame.Day +
Shipping.ModeSecond.Class,
data = SCM_train, method = "class", control = r.ctrl)
n= 126363
Interpretation of the CART model output including pruning, plot of the pruned tree
The Root Node had total observation of 126363 of which 57079 observations did not have Late deliver
risk. The error rate at the Root Node is 45% (or) otherwise the impurity factor is 45%. The objective of
CART splitting it to get purity in the node (or) reducing the error rate by splitting.
There is technique this algorithm uses called K Fold Cross Validation which is resampling procedure.
The Cost Complexity factor CP value determines up to what level should we cut the tree.
The tree tells us the root note CP is high, split is 0, relative error and cross validation errors are 1
each, standard deviation amongst cross validated group is 0.00309933.
As tree builds the relative error decreases, these are in-sample errors. However cross validation
error/Standard deviation decrease as the tree is cut to 3,4,6 etc. In a CART model there would be
inflexion point beyond which cutting tree further is sub-optimal. In this case 55 nodes looks optimal
and tree is complex because of higher number of splits involved
The Tree looks complex and Pruning of the tree may not be required as the CP is e05 for the 45 Split.
Model Tuning: -
However, we tried CP value of 0.0000011680 to see how the pruned tree looks like Fig 3.5
Page 44 of 79
Capstone-Project-Supplychain-Dataco- Final Report
The Pruned Tree is complex as well since there are many predictors, hence could not yield better
visualisation
CP also called as Cost Complexity chart below for the above pruned tree.
Classification tree:
rpart(formula = SCM_train$Late_delivery_risk ~ Revenue + Profit +
Discount + Location + Schedule + TypeCASH + TypeDEBIT + TypePAYMENT +
Shipping.ModeFirst.Class + Shipping.ModeSame.Day +
Shipping.ModeSecond.Class,
data = SCM_train, method = "class", control = r.ctrl)
n= 126363
Page 45 of 79
Capstone-Project-Supplychain-Dataco- Final Report
Page 46 of 79
Capstone-Project-Supplychain-Dataco- Final Report
0 1
0 54062 3017
1 193 69091
Train Data: -
0 1
0 23150 1313
1 124 29569
> nrow(SCM_test)
[1] 54156
Page 47 of 79
Capstone-Project-Supplychain-Dataco- Final Report
Interpretation: -
----------------------------------------------------------------------------------------------------------------------------------------
Definition of evaluation parameters: -
Before we jump in to interpretation of the results, it is important to understand the what the measure
means, which is explained below.
Accuracy = Out of all cases how much did we correctly predict = (TP+TN)/(TP+TN+FP+FN)
Sensitivity/Recall = Out of all positive cases how many you are able to predict correctly i.e. how
good the test is detecting positive cases= TP / (TP +FN)
Specificity = Out of all negative cases predicted, how many are predicted correctly i.e. how good
the test is avoiding false alarms = TN / (TN+FP)
Precision = How many of the positively classified were relevant = TP/(TP+FP)
F Measure = Measure of Precision and Recall at same time = Harmonic Mean =
2*Recall*Precision / (Recall+Precision)
----------------------------------------------------------------------------------------------------------------------------------------
IN test data, 29569 observations were predicted as Late delivery, and 124 observations predicted as
No late delivery. The wrong predictions were 1313 predictions which were predicted as late delivery,
but they were not late delivery and 124 predictions which were predicted as no late delivery were
actually late delivery.
Since the objective is to reduce the Late delivery, Sensitivity/Recall of predicting True Positive (TP =
outcome where the model correctly predicts the positive class) is of prime importance. Which the
model has produced 99.58%.
Test data has performed closer to the train data; hence the conclusion is the CART model is
robust.
ROC- AUC- KS Evaluation: -
For classification problems with probability outputs, a threshold can convert probability outputs to
classifications. This choice of threshold can change the confusion matrix and as the threshold
changes, a plot of the false positive Vs true positive is called the ROC curve- Receiver Operating
Characteristic. AUC – Area under curve. It is one of the most important evaluation metrics for
checking any classification model’s performance. It is also written as AUROC- Area under the receiver
operating characteristics.
ROC is a probability curve and AUC represents degree or measure of separability. Higher the AUC,
better the model is at predicting 0s as 0s and 1s as 1s.
The results from the CART model was reviewed with ROC-AUC parameters and the model evaluation
presented in (Fig-3.6)
KS & AUC:
> KS.CART.Test
[1] 0.9435323
Page 48 of 79
Capstone-Project-Supplychain-Dataco- Final Report
> auc.CART.Test
[1] 0.9920192
Interpretation: -
Test data ROC curve curved towards the left (True positive rate), this indicates good proportion of data
expected to be predicted correctly
KS and AUC supports the ROC curve with higher percentages- 94.35% & 99.20% respectively, which
indicate the CART model is robust in test data.
GINI Coefficient: -
> gini.CART.Test
[1] 0.4454714
Gini index is a CART algorithm which measures a distribution among affection of specific-field with the
result of instance. It means, it can measure how much every mentioned specification is affecting
directly in the resultant case.
Gini index is used in the real-life scenario. And data is real which is taken from real analysis. In many
definitions, they have mentioned as ‘an impurity of data’ or we can say ‘How much-undistributed
data is’. From this, we can also measure that which data from every field is taking less (or) more part
in the decision-making the process. So further we can focus on that particular field/variable.
CART (Classification and Regression Trees) → uses Gini Index(Classification) as metric. If all the data
belong to a single class, then it can be called pure. Its Degree will be always between 0 and 1. If 0,
means all data belongs to the single class/variable. If 1, the data belong to the different class/field.
Here the GINI is 44.54% which shows no skewness
Random Forest is also Supervised (Supervised means the Target to be achieved is known) Machine
Learning Technique to build Prediction Model. Since the decision trees are very sensitive to even
small changes in the data, usually they are unstable. Instead of one CART tree the big idea is to grow
more CART trees, which are otherwise called a Forest of CART trees, which can improve the
robustness of the prediction model built. The idea is individual trees tend to over-fit training data
(refer to the earlier section on Over-Fit), hence averaging correct this.
The Algorithm: -
Since multiple CART trees are built, to avoid the all trees looking similar Randomness technique is
used. Typically, the model is to pick random values with replacement. This technique is also called
ensemble technique, since multiple CART models are built. For sampling Bootstrap aggregating also
called as Bagging is used to arrive at population parameters, which is randomly subset the sample
data with replacement, by sampling with replacement some of the observations may be repeated in
each subset. Bootstrap not only samples rows but also columns (or variables) e.g. 12 variables, boot
strap will build model using say 5 variables selected at random each time to build the model.
The algorithm measures error rate called OOB (Out of Bag errors) e.g. A model is built with 70% of
data say total data is 1000 and model is built on 700 and for the balance 300 predict class is assigned,
200 classified correctly and 100 were errors, this error ratio is called OOB (Out of Bag errors). Pruning
is not needed, however tuning of the tree can be done in the algorithm to get optimal output.
Page 49 of 79
Capstone-Project-Supplychain-Dataco- Final Report
This data was split in to Training and Testing Data and first the RF model was applied to the Training
data and Random Forest was constructed on trial and error method before tuning. The output of the
Random Forest displayed below:
Call:
randomForest(formula = SCM_train_RF$Late_delivery_risk ~ Revenue +
Profit + Discount + Location + Schedule + TypeCASH + TypeDEBIT +
TypePAYMENT + Shipping.ModeFirst.Class + Shipping.ModeSame.Day +
Shipping.ModeSecond.Class, data = SCM_train_RF, ntree = 101, mtry = 5,
nodesize = 100, importance = TRUE)
Type of random forest: classification
Number of trees: 101
No. of variables tried at each split: 5
Model Tuning: -
We can further tune the Random forest using the tuning algorithm and output of tuning below:-
From the graph it is understandable that the algorithm tried 3,4,5 different mtry combinations and after
10 found errors increasing. So mtry of 10 and ntree of 81 can give optimal results. Hence, trying
with those parameters produced the below output
Call:
randomForest(formula = SCM_train_RF$Late_delivery_risk ~ Revenue +
Profit + Discount + Location + Schedule + TypeCASH + TypeDEBIT +
TypePAYMENT + Shipping.ModeFirst.Class + Shipping.ModeSame.Day +
Shipping.ModeSecond.Class, data = SCM_train_RF, ntree = 81, mtry = 10,
nodesize = 100, importance = TRUE)
Type of random forest: classification
Number of trees: 81
No. of variables tried at each split: 10
Page 50 of 79
Capstone-Project-Supplychain-Dataco- Final Report
Variable Importance: -
Ranking the Importance of the Independent variable, shows Schedule, Location, Shipping Mode-
Second Class, First class are important independent variables that determines late deliver risk shows
in (Fig 3.8)
0 1
0 53950 3129
1 101 69183
Train Data: -
Page 51 of 79
Capstone-Project-Supplychain-Dataco- Final Report
Random Forest-Predict
0- No Late Delivery 1- Late Delivery
0- No Late Delivery 53950 3129
Late Delivery 1- Late Delivery 101 69183
Accuracy- 97.44%, Sensitivity (or) Recall- 99.85%, Specificity- 94.52%, Precision –
95.67%, F Measure- 97.11%
Interpretation: -
IN train data, 69183 observations were predicted as Late delivery, and 101 observations predicted as
No late delivery. The wrong predictions were 3129 predictions which were predicted as late delivery,
but they were not late delivery and 101 predictions which were predicted as no late delivery were
actually late delivery
Test Data: -
> tbl.test.rf=table(SCM_test_RF$Late_delivery_risk,
SCM_test_RF$predict.class)
> tbl.test.rf
0 1
0 23122 1341
1 36 29657
Interpretation: -
----------------------------------------------------------------------------------------------------------------------------------------
Definition of evaluation parameters: -
Before we jump in to interpretation of the results, it is important to understand the what the measure
means, which is explained below.
Accuracy = Out of all cases how much did we correctly predict = (TP+TN)/(TP+TN+FP+FN)
Sensitivity/Recall = Out of all positive cases how many you are able to predict correctly i.e. how
good the test is detecting positive cases= TP / (TP +FN)
Specificity = Out of all negative cases predicted, how many are predicted correctly i.e. how good
the test is avoiding false alarms = TN / (TN+FP)
Precision = How many of the positively classified were relevant = TP/(TP+FP)
Page 52 of 79
Capstone-Project-Supplychain-Dataco- Final Report
> KS.RF.Test
Page 53 of 79
Capstone-Project-Supplychain-Dataco- Final Report
[1] 0.9441506
> auc.RF.Test
[1] 0.9895402
Interpretation: -
Test data ROC curve curved towards the left (True positive rate), this indicates good proportion of data
expected to be predicted correctly
KS and AUC supports the ROC curve with higher percentages- 94.41% & 98.95% respectively, which
indicate the Random forest model is robust in test data.
GINI Coefficient
> gini.RF.Test
[1] 0.432492
Gini index is a Random forest algorithm which measures a distribution among affection of specific-field
with the result of instance. It means, it can measure how much every mentioned specification is
affecting directly in the resultant case.
Gini index is used in the real-life scenario. And data is real which is taken from real analysis. In many
definitions, they have mentioned as ‘an impurity of data’ or we can say ‘How much-undistributed data
is’. From this, we can also measure that which data from every field is taking lessor more part in the
decision-making the process. So further we can focus on that particular field/variable.
Random forest uses Gini Index(Classification) as metric. If all the data belong to a single class, then it
can be called pure. Its Degree will be always between 0 and 1. If 0, means all data belongs to the
single class/variable. If 1, the data belong to the different class/field. Here the GINI is 43.25% which
indicates no skewness.
We applied Bagging method to the train data to build model and applied the model to test data to
predict. The output from the model as below.
Page 54 of 79
Capstone-Project-Supplychain-Dataco- Final Report
+ data = SCM_train,
+ control=rpart.control(maxdepth=5, minsplit=4))
> BaggingPredict = predict(BaggingModel, newdata = SCM_test)
> tabBagging = table(SCM_test$Late_delivery_risk, BaggingPredict)
> tabBagging
BaggingPredict
0 1
0 21722 2741
1 1388 28305
Model Evaluation: -
The measure of performance for predictive Ensemble models, evaluated through method of
Confusion Matrix: - For the class output from the models classification error of predicted vs actuals
were drawn to understand the Accuracy- The ratio of classifications that were done correctly and
Sensitivity i.e. (proportion of total positives that were correctly identified of the model.
Confusion matrix and Interpretation
Interpretation: -
----------------------------------------------------------------------------------------------------------------------------------------
Definition of evaluation parameters: -
Before we jump in to interpretation of the results, it is important to understand the what the measure
means, which is explained below.
Accuracy = Out of all cases how much did we correctly predict = (TP+TN)/(TP+TN+FP+FN)
Sensitivity/Recall = Out of all positive cases how many you are able to predict correctly i.e. how
good the test is detecting positive cases= TP / (TP +FN)
Specificity = Out of all negative cases predicted, how many are predicted correctly i.e. how good
the test is avoiding false alarms = TN / (TN+FP)
Precision = How many of the positively classified were relevant = TP/(TP+FP)
F Measure = Measure of Precision and Recall at same time = Harmonic Mean =
2*Recall*Precision / (Recall+Precision)
----------------------------------------------------------------------------------------------------------------------------------------
Page 55 of 79
Capstone-Project-Supplychain-Dataco- Final Report
Sensitivity/Recall is i.e. Model is able to spot the late delivery up to 95% and Specificity which non-late
delivery prediction is 91%.
Since the objective is to reduce the Late delivery, Sensitivity/Recall of predicting True Positive (TP =
outcome where the model correctly predicts the positive class) is of prime importance, which the
model is able to predict 95% and false alarm is yet fine. The Model result is robust.
Bias-Variance Trade-off
The bias is an error from erroneous assumptions in the learning algorithm. High bias can cause an
algorithm to miss the relevant relations between features and target outputs (under fitting).
The variance is an error from sensitivity to small fluctuations in the training set. High variance can
cause an algorithm to model the random noise in the training data, rather than the intended outputs
(overfitting).
It is to be noted that Bagging reduces the variance, but retains some of the bias
XGBoost works with matrices that contain all numeric variables. All categorical to be converted to
dummies, we also need to split the training data and label. Boosting method using binary categorical
variables and all numeric variables.
Page 56 of 79
Capstone-Project-Supplychain-Dataco- Final Report
FALSE TRUE
0 23143 1320
1 17 29676
Model Evaluation: -
The measure of performance for predictive Ensemble models, evaluated through method of
Confusion Matrix: - For the class output from the models classification error of predicted vs actuals
were drawn to understand the Accuracy- The ratio of classifications that were done correctly and
Sensitivity i.e. (proportion of total positives that were correctly identified of the model.
Confusion matrix and Interpretation
Interpretation: -
----------------------------------------------------------------------------------------------------------------------------------------
Definition of evaluation parameters: -
Before we jump in to interpretation of the results, it is important to understand the what the measure
means, which is explained below.
Accuracy = Out of all cases how much did we correctly predict = (TP+TN)/(TP+TN+FP+FN)
Sensitivity/Recall = Out of all positive cases how many you are able to predict correctly i.e. how
good the test is detecting positive cases= TP / (TP +FN)
Specificity = Out of all negative cases predicted, how many are predicted correctly i.e. how good
the test is avoiding false alarms = TN / (TN+FP)
Precision = How many of the positively classified were relevant = TP/(TP+FP)
F Measure = Measure of Precision and Recall at same time = Harmonic Mean =
2*Recall*Precision / (Recall+Precision)
----------------------------------------------------------------------------------------------------------------------------------------
Page 57 of 79
Capstone-Project-Supplychain-Dataco- Final Report
Sensitivity/Recall is i.e. Model is able to spot the late delivery up to 99.94% and Specificity which non-
late delivery prediction is 94.60%.
Since the objective is to reduce the Late delivery, Sensitivity/Recall of predicting True Positive (TP =
outcome where the model correctly predicts the positive class) is of prime importance, which the
model is able to predict 99% and false alarm is yet fine. The Model result is robust and no fine
tuning needed and the model has achieved its purpose.
Bias-variance trade-off: -
The bias is an error from erroneous assumptions in the learning algorithm. High bias can cause an
algorithm to miss the relevant relations between features and target outputs (under fitting).
The variance is an error from sensitivity to small fluctuations in the training set. High variance can
cause an algorithm to model the random noise in the training data, rather than the intended outputs
(overfitting).
Comparing the all the Models Classification models- CART, Random Forest and Ensemble Model-
Boosting performed the best. In the above table ranking indicated with colour codes to indicate the
ranks of the results as Green- First best result, Amber- Second Best & Yellow- Third Best.
Amongst the model Ensemble models yielding best result overall, amongst the Ensemble models XG-
Boosting produced the best results in terms of Accuracy, Sensitivity, Specificity, Precision and F
Measure.
In this case, we are predicting whether the delivery will be done on time or not with intention to identify
the reason for the late deliveries. Hence, identifying the True Positive I.e. Late Delivery risk is of
utmost importance. For this purpose, Sensitivity, Precision and Accuracy pays a vital role
combined with F measure
Conclusion: -
Amongst the Models Ensemble Method with XG Boosting stood out in this parameters, hence
considered to the best model. Random Forest is the second best
Page 58 of 79
Capstone-Project-Supplychain-Dataco- Final Report
Page 59 of 79
Capstone-Project-Supplychain-Dataco- Final Report
The scores for actual days of shipping for both – deliveries without delays and with delays depict left
skewness with higher left skewness for delayed deliveries. This shows that delays have occurred
after the product has been shipped.
The reasons for the delay (or) late delivery could not be identified with given information. Hence, we
recommend business to help with the following addition data on A. Product Flow B. Information Flow
C. Revenue Flow.
1. Location- Both Origin Place and Destination Place
2. Mode of Shipment- Air, Ocean, Rail, Combined
3. Transhipment involved
4. Idle time – Transhipment, Trucker sleep time, Clearance Paper work etc.
5. Expected Transit time for the mode
6. Parties involved in transportation
7. Parties Schedule reliability measures
8. Parties Communication channels- Information flow
9. Customs Clearance- Involved.
10. Turn Around Time- Customs
11. Payment TAT
Page 60 of 79
Capstone-Project-Supplychain-Dataco- Final Report
CART- Trees were big in size; hence visualisation of the tree could not be presented to its
best.
RANDOM Forest different mtry combination may yield different results.
Decision Trees- CART and Random Forest produced better results compared to Logistic
regression (or) Frequency Based Algorthims.
Ensemble models - XGBoost works with matrices that contain all numeric variables. All
categorical to be converted to dummies
Learning model XG Boost produced the best results and hence considered as best
model based on the parametric evaluation
Thus to build an accurate model which works well with the business it is necessary to get the right
data with the most meaningful features at the first instance. To overcome this data issue, would need
to communicate with the business to get enough data and then use domain understanding to get rid of
the irrelevant features. This is a backward elimination process but one which often comes handy in
most occasions.
The following insights elucidated from this study and hence the recommendations to the business are:
In the given dataset, we can infer and/or predict late deliveries based on the limited information
provided on the product price, discount, profitability, sales and quantity sold, shipping timelines –
real and scheduled and location of store from where products are shipped.
It is important to get some more information regarding the Origin-Destination, transit time involved,
vendor schedule reliability, Idle time in transportation to identify the cause and tune the model for
better prediction.
There is not data available on the Schedule Reliability and Vendor performance, it is
recommended business to provide data on “Schedule Reliability”, “On time Delivery”. If no such
measures available introduction of such KPI measure for “Staff Performance”, “Vendor
Performance” to boost performance.
Page 61 of 79
Capstone-Project-Supplychain-Dataco- Final Report
For products with higher discounts, there is an increased risk of delay in delivery. Due to higher
discounts, there are high volumes of product orders giving rise to difficulties in on-time deliveries
with existing logistics plans/ resources. Suggestion is to carefully plan logistics when discounts are
offered.
Lower uptake of Same Day (5%)& First Class (15%) opportunity to improve delivery performance
and charge premium to customers to improve revenue.
Other best practices from the Supply chain industry listed below only as a suggestion to review and
take advantage of the upcoming trends.
Flow of information throughout the supply chain End to End is of utmost importance for prompt
delivery. Hence, invest on technology like- IOT, Block Chain to develop platforms were all parties
can be on one system exchange information seamlessly
Creating Transparency for real time tracking, publishing the delivery results both at transaction
and cumulated transaction, so everyone in the supply chain knows timeliness of delivery.
Feedback from the chain what caused the delays, so improvements could be seek through crowd
sourcing.
Assessing the traffic situation, embark on usage of “Drones” and “Robotic Arms” to deliver goods
much faster
Controlled Inventory situation to keep stock of the fast moving goods, and to avoid the “Bull-Whip-
Effect” using periodization models like ABC analysis etc.
SECTION 7: BIBILIOGRAPHY
----End of Report-----
APPENDIX A
---------------------------------------------------------------------------------------------------------------------------
Appendix A covers the following chapters: -
A1 R-SOURCE CODE
A2 TABELEAU VISUALISATION SOURCE CODE
Page 62 of 79
Capstone-Project-Supplychain-Dataco- Final Report
A3 UNIVARIATE ANALYSIS
A4 BIVARIATE ANALYSIS
---------------------------------------------------------------------------------------------------------------------------
A1 R-SOURCE CODE
SCM_Project_Final_
Rcodes_Hariharan.KP.R
Page 63 of 79
Capstone-Project-Supplychain-Dataco- Final Report
Page 64 of 79
Capstone-Project-Supplychain-Dataco- Final Report
Page 65 of 79
Capstone-Project-Supplychain-Dataco- Final Report
Page 66 of 79
Capstone-Project-Supplychain-Dataco- Final Report
A3 UNIVARIATE ANALYSIS
Univariate analysis: -
# Nominal, Ordinal & Geo spatial Variables#
Below variables which are nominal, ordinal and Geospatial in nature were not considered for the
univariate analysis.
Customer Id, Customer Zip code, Department Id, Latitude, Longitude, Order Customer Id, Order Id,
Order Item Card Prod Id, Order Item Id, Order Zip code, Product Card Id, Product Category ID,
Masked Customer Key
Page 67 of 79
Capstone-Project-Supplychain-Dataco- Final Report
# Numeric Variables#
Appendix- Fig-1
Inferences:
The minimum days of actual ship days is 0, while the maximum is 6. Between these two values, the
actual ship days looks to be spread. The Mean (3.00) and Median (3.498) data is right skewed. No
outliers observed
Appendix- Fig-2
Inferences:
The minimum days of scheduled ship days is 0, while the maximum is 4. Between these two values,
the scheduled ship days looks to be spread. The Mean (2.9) and Median (4.0) data is right skewed.
No outliers observed
3. Benefits per order
Appendix- Fig-3
Page 68 of 79
Capstone-Project-Supplychain-Dataco- Final Report
Inferences:
The minimum days of benefits per order is -4274.98, while the maximum is 911.8. Between these two
values, the data is heavily left skewed. The Mean (21.9) and Median (31.5). Many outliers observed in
the data. We can see that there is less benefits per order of many transactions in the negative
region
Appendix- Fig-4
Inferences:
The minimum sales per customer is 7.49, while the maximum is 1939.99. Between these two values,
the data is heavily right skewed. The Mean (183.1) and Median (163.9). Many outliers observed in the
data.
5. Category ID
Appendix- Fig-5
Inferences:
Page 69 of 79
Capstone-Project-Supplychain-Dataco- Final Report
The minimum category id is 2, while the maximum is 76. Between these two values, the data is
distributed. The Mean (31.8) and Median (29). No outliers in data.
6. Order Item Discount
Appendix- Fig-6
Inferences:
The minimum order item discount is 0, while the maximum is 500. Between these two values, the data
is right skewed. The Mean (20.6) and Median (14). Many outliers in this data
Appendix- Fig-7
Inferences:
The minimum order item discount rate is 0, while the maximum is 0.25. Between these two values, the
data is slightly left skewed. The Mean (0.10) and Median (0.1017). No outliers in this data
Appendix- Fig-8
Page 70 of 79
Capstone-Project-Supplychain-Dataco- Final Report
Inferences:
The minimum order item product price is 9.99, while the maximum is 1999.99. Between these two
values, the data is slightly right skewed. The Mean (141.23) and Median (59.99). Few outliers in this
data
Appendix- Fig-9
Inferences:
The minimum order item profit ratio is -2.75, while the maximum is 0.50. Between these two values,
the data is heavily left skewed. The Mean (0.12) and Median (0.27). many outliers in this data. We
can see that there is less order per order of many transactions in the negative region
10. Sales
Appendix- Fig-10
Inferences:
The minimum sales is 9.99, while the maximum is 1999.99. Between these two values, the data is
heavily right skewed. The Mean (203.77) and Median (199.92). few outliers in this data
Page 71 of 79
Capstone-Project-Supplychain-Dataco- Final Report
Inferences:
The minimum order item total is 7.49, while the maximum is 1939.99. Between these two values, the
data is heavily right skewed. The Mean (183.11) and Median (163.99). Many outliers in this data
Appendix- Fig-12
Inferences:
The minimum profit per order is -4274.98, while the maximum is 911.80. Between these two values,
the data is heavily left skewed. The Mean (21.98) and Median (31.52). Many outliers in this data
Appendix- Fig-13
Inferences:
The minimum product price is 9.99, while the maximum is 1999.99. Between these two values, the
data is heavily right skewed. The Mean (141.23) and Median (59.99). Few outliers in this data.
Majority of the products price is less than 1000, 442 products have price of 1500 and 15 products price
of 20000.
Page 72 of 79
Capstone-Project-Supplychain-Dataco- Final Report
# Categorical Variables#
Appendix- Fig-14
Appendix- Fig-14
Inferences:
Type- Customers who transacted by Debit was the highest 38% followed by transfer 28% and
payment 23%, customer who paid by Cash was less 11%
Delivery Status- There was 55% of shipments which were late delivered, 18% delivered on time
and 23% advance shipped. 4% of orders were cancelled (could be due to poor delivery
performance)
Late Delivery Risk- 55% of shipments were at late delivery risk
Customer Segment- 52% customers were consumers, 30% were corporate customers and 18%
home office. Higher proportion of end consumers implies prompt delivery is must have for Data co.
Supply chain
Order Item Quantity- 55% customers ordered item quantity was 1. While 2, 3, 4, 5 quantity orders
were 11% each. Lower quantity orders, means higher transactions, hence efficient supply chain
needed for on time delivery.
Product Status- Product availability was 100%, which implies good inventory was carried by the
company (which also means there is inventory carrying cost associated)
Page 73 of 79
Capstone-Project-Supplychain-Dataco- Final Report
Appendix- Fig-15
Inferences:
Order Status - Only 44% of orders were completed/closed status. Rest 56% of orders is at
delivery risk and realisation of payment. There is 2% of orders which are suspected as fraud. This
implies that unless company improves on its supply chain capabilities to deliver on time cannot sustain
in the business.
Shipping Mode- 60% of orders was standard class- which is 4-day window to deliver the goods,
while 5% & 15% of orders were either same day or First class i.e. 20% of orders require to be
delivered by same day delivery. It implies that efficient supply chain mechanism needed for speed of
delivery.
Page 74 of 79
Capstone-Project-Supplychain-Dataco- Final Report
A4 BIVARIATE ANALYSIS
Categorical Vs Numerical Variables: -
## Box Plots ##
Appendix Fig-16
Inferences:
Days of shipping(real)- Box plot of Late delivery risk against actual shipping days of purchased
product shows average delivery days for late delivery is 5 days.
Days of shipping(scheduled)- Box plot of Late delivery risk against scheduled shipping days of
purchased product shows average lead time of 2 days. It is understood from the data that actual
delivery is higher than the scheduled delivery which is causing risk of late delivery.
Benefits per order- Box plot shows benefits per order is low for both timely delivery, but for late
delivery benefits gets worse.
Sales per customer- Box plot shows sales per customer is low for both timely delivery and late
delivery, however it is to be noted that losing the customer is high if late delivery continues.
Page 75 of 79
Capstone-Project-Supplychain-Dataco- Final Report
Appendix- Fig-17
Inferences:
Order Item Discount- Box plot of Order item discount shows large discounts are given for both late
delivery and on time delivery. Seems variable is non-significant for late delivery
Order Product Price- Box plot of Order product price shows similar price for both late delivery and on
time delivery. Seems variable is non-significant for late delivery
Order Profit Ratio- Box plot of Order profit ratio shows profit ratios are very thin and profit is actually
on negative side. If the company has to command premium to improve the profit ratio on time delivery
is must.
Sales- Box plot of Sales shows no significant difference between on time vs late delivery
Product Price- Box plot of Sales shows no significant difference between on time vs late delivery
Order item Total- Box plot of order item total do not show significant difference as discounts are
offered for both late delivery and on time delivery.
Page 76 of 79
Capstone-Project-Supplychain-Dataco- Final Report
Appendix- Fig 18
Inferences:
Type- Bar plot shows late delivery risk is higher for all payment types except Transfer
Customer Segment- All customer segments are running the risk of late delivery
Order Item Quantity - All order item quantity is running the risk of late delivery, however proportion is
high for 1 order quantity
Product status – Products are available, but yet the proportion of late delivery risk is higher/
Shipping mode – First class and Second Class running higher risk of late delivery, while standard
class and same day delivery still has significant late delivery
Page 77 of 79
Capstone-Project-Supplychain-Dataco- Final Report
Appendix- Fig-19
Appendix- Gig-20
Inferences:
Order Status- Higher Pending Payments are because of late delivery. This variable does not seem to
have significant impact in determining late delivery as this is just status tracking
Category Name- Certain categories of goods like Cleats, Women apparels, Indoor/outdoor games,
Cardio equipment seems to carry higher risk of late delivery.
Page 78 of 79
Capstone-Project-Supplychain-Dataco- Final Report
----End-----
Page 79 of 79