# ggpairs
library(ggplot2)
library(GGally)
library(hrbrthemes)
# %>%
library(tidyverse)

# kfold
library(caret)
kfoldcv = 10

# lda
library(MASS)

# prediction
library(ROCR)

#partimat
library(klaR)
library(gridExtra)
library(yardstick)
library(plotly)
library(kableExtra)

Introduction

In this project, you will use classification methods covered in this course to solve a real historical data mining problem: locating displaced persons living in makeshift shelters following the destruction of the earthquake in Haiti in 2010.

Following that earthquake, rescue workers, mostly from the United States military, needed to get food and water to the displaced persons. But with destroyed communications, impassable roads, and thousands of square miles, actually locating the people who needed help was challenging. As part of the rescue effort, a team from the Rochester Institute of Technology were flying an aircraft to collect high resolution geo-referenced imagery.

HaitiPixels <- read.csv(file = '/Users/awells/UVA/SYS6018/Disaster_Relief_Project/HaitiPixels.csv')
head(HaitiPixels)
##        Class Red Green Blue
## 1 Vegetation  64    67   50
## 2 Vegetation  64    67   50
## 3 Vegetation  64    66   49
## 4 Vegetation  75    82   53
## 5 Vegetation  74    82   54
## 6 Vegetation  72    76   52
HaitiPixels$Class <- as.factor(HaitiPixels$Class)
#summary(HaitiPixels)

#### Delete these 2 lines when ready for full set
#data <- sample_n(HaitiPixels, 10000)
#HaitiPixels <- data
#### 

dim(HaitiPixels)
## [1] 63241     4
summary(HaitiPixels)
##               Class            Red          Green            Blue      
##  Blue Tarp       : 2022   Min.   : 48   Min.   : 48.0   Min.   : 44.0  
##  Rooftop         : 9903   1st Qu.: 80   1st Qu.: 78.0   1st Qu.: 63.0  
##  Soil            :20566   Median :163   Median :148.0   Median :123.0  
##  Various Non-Tarp: 4744   Mean   :163   Mean   :153.7   Mean   :125.1  
##  Vegetation      :26006   3rd Qu.:255   3rd Qu.:226.0   3rd Qu.:181.0  
##                           Max.   :255   Max.   :255.0   Max.   :255.0
attach(HaitiPixels)

It was known that the people whose homes had been destroyed by the earthquake were creating temporary shelters using blue tarps, and these blue tarps would be good indicators of where the displaced persons were – if only they could be located in time, out of the thousands of images that would be collected every day.

detach(HaitiPixels)
# Blue Tarp
BlueTarp_rest <- rep("Down",length(HaitiPixels$Class))
BlueTarp_rest[HaitiPixels$Class == "Blue Tarp"] <- "Up"

HaitiPixels <- data.frame(HaitiPixels,BlueTarp_rest )

HaitiPixels$BlueTarp_rest <- as.factor(HaitiPixels$BlueTarp_rest)

summary(HaitiPixels)
##               Class            Red          Green            Blue      
##  Blue Tarp       : 2022   Min.   : 48   Min.   : 48.0   Min.   : 44.0  
##  Rooftop         : 9903   1st Qu.: 80   1st Qu.: 78.0   1st Qu.: 63.0  
##  Soil            :20566   Median :163   Median :148.0   Median :123.0  
##  Various Non-Tarp: 4744   Mean   :163   Mean   :153.7   Mean   :125.1  
##  Vegetation      :26006   3rd Qu.:255   3rd Qu.:226.0   3rd Qu.:181.0  
##                           Max.   :255   Max.   :255.0   Max.   :255.0  
##  BlueTarp_rest
##  Down:61219   
##  Up  : 2022   
##               
##               
##               
## 
attach(HaitiPixels)
BlueTarp_rest <- as.factor(BlueTarp_rest)

The problem was that there was no way for aid workers to search the thousands of images in time to find the tarps and communicate the locations back to the rescue workers on the ground in time. The solution would be provided by data-mining algorithms, which could search the images far faster and more thoroughly (and accurately?) then humanly possible.

The goal was to find an algorithm that could effectively search the images in order to locate displaced persons and communicate those locations rescue workers so they could help those who needed it in time. This disaster relief project is the subject matter for your project in this course, which you will submit in two parts. You will use data from the actual data collection process was carried out over Haiti. Your goal is to test each of the algorithms you learn in this course on the imagery data collected during the relief efforts made Haiti in response to the 2010 earthquake, and determine which method you will use to as accurately as possible, and in as timely a manner as possible, locate as many of the displaced persons identified in the imagery data so that they can be provided food and water before their situations become unsurvivable.

HaitiPixels %>% ggpairs(upper=list(continuous=wrap("cor", size=3)))

Setup

For the set up of the data set ( HaitiPixels, \(n=\) 63241) a likely decision is to use a one vs rest approach. Our data is a multi-class classification problem with binary classifications:

  • Blue Tarp vs [ Rooftop , Soil , Various Non-Tarp, Vegetation ]

  • Rooftop vs [ Blue Tarp, Soil , Various Non-Tarp, Vegetation ]

  • Soil vs [ Blue Tarp, Rooftop , Various Non-Tarp, Vegetation ]

  • Various Non-Tarp vs [ Blue Tarp, Rooftop , Soil , Vegetation ]

  • Vegetation vs [ Blue Tarp, Rooftop , Soil , Various Non-Tarp ]

# Get Final Hold - Out 5% of Data ~3000 observations
data.fho = sort(sample(nrow(HaitiPixels), nrow(HaitiPixels)*.05))

fho <- HaitiPixels[data.fho,]
data_remain <- HaitiPixels[-data.fho,]

#80-20 split Train-Test

data_minus_fho = sort(sample(nrow(data_remain), nrow(data_remain)*.8))
train <-  data_remain[data_minus_fho,]
test <- data_remain[-data_minus_fho,]

# shuffle just to be safe
train <- dplyr::sample_n(train, nrow(train))
test <- dplyr::sample_n(test, nrow(test))
fho <- dplyr::sample_n(fho, nrow(fho))

test.BlueTarp <- as.factor(test$BlueTarp_rest)
train$BlueTarp_rest <- as.factor(train$BlueTarp_rest)
levels(train$BlueTarp_rest) <- c("Down","Up")

#paste("Length of Train: ", nrow(train))
#paste("Length of Test: ", nrow(test))
#paste("Length of Final Hold Out: ", nrow(fho))

Since we are looking for temporary shelters made using blue tarps, we generated a variable named “BlueTarp_rest”, a binary classifier that has two levels

  • “Up” - if a blue tarp was visible (or up )

  • “Down” - if a blue tarp was not visible.

We’ll take a random sample of the HaitiPixels dataset that’s 5% of the data set. We’ll name this the FHO, or Final HoldOut Dataset to be used in Part 2 ( Hold-Out Test Sample Performance). We randomly take the remaining data and divide it into an 80% training set and 20% test set. This leaves a training set of \(n=\) 48063, a test set of \(n=\) 12016 and a FHO of \(n=\) 3162

Analysis

KNN

The first classification process that we used is the K-nearest neighbors regression. Given a value for K and a prediction point \(x_0\), KNN regression first identifies the K training observations that are closest to \(x_0\), represented by \(N_0\). The optimal value for K will depend on the bias-variance tradeoff. A small value for K provides the most flexible fit, which will have low bias but high variance. This variance is due to the fact that the prediction in a given region is entirely dependent on just one observation. In contrast, larger values of K provide a smoother and less variable fit; the prediction in a region is an average of several points, and so changing one observation has a smaller effect. We’ll use the caret package in R to apply the KNN classification procedure using a 10 - Fold Cross-Validation evaluation procedure.

set.seed(1)

# https://blog.revolutionanalytics.com/2016/05/using-caret-to-compare-models.html
classifier <- caret::train(BlueTarp_rest ~ Red + Green + Blue, data=train, method="knn", 
              preProcess=c("center","scale"), 
              tuneGrid=data.frame(k=seq(1,75,1)),
              #tuneGrid=data.frame(k=seq(1,200,2)),
              #tuneGrid=data.frame(k=seq(1,500,5)),
              trControl = caret::trainControl("cv", number=kfoldcv, 
                          returnResamp='all', savePredictions='final',
                          classProbs=TRUE))
classifier
## k-Nearest Neighbors 
## 
## 48063 samples
##     3 predictor
##     2 classes: 'Down', 'Up' 
## 
## Pre-processing: centered (3), scaled (3) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 43257, 43256, 43257, 43256, 43257, 43257, ... 
## Resampling results across tuning parameters:
## 
##   k   Accuracy   Kappa    
##    1  0.9967335  0.9473104
##    2  0.9965670  0.9447763
##    3  0.9971288  0.9538469
##    4  0.9968999  0.9500344
##    5  0.9970872  0.9532107
##    6  0.9971288  0.9540576
##    7  0.9971912  0.9549753
##    8  0.9971704  0.9546549
##    9  0.9971496  0.9542347
##   10  0.9971288  0.9539796
##   11  0.9972120  0.9552741
##   12  0.9971288  0.9538295
##   13  0.9971912  0.9548711
##   14  0.9971912  0.9549441
##   15  0.9971912  0.9549148
##   16  0.9971496  0.9542795
##   17  0.9971288  0.9538579
##   18  0.9971080  0.9534914
##   19  0.9970040  0.9518402
##   20  0.9970456  0.9524746
##   21  0.9970872  0.9531428
##   22  0.9970456  0.9524740
##   23  0.9970040  0.9517660
##   24  0.9969624  0.9510828
##   25  0.9969624  0.9510522
##   26  0.9968792  0.9496469
##   27  0.9969832  0.9513546
##   28  0.9969624  0.9509477
##   29  0.9970040  0.9516051
##   30  0.9969000  0.9499607
##   31  0.9969208  0.9502750
##   32  0.9967543  0.9474787
##   33  0.9967751  0.9477900
##   34  0.9966919  0.9464564
##   35  0.9966711  0.9460519
##   36  0.9966503  0.9457844
##   37  0.9966503  0.9457329
##   38  0.9965046  0.9432600
##   39  0.9965671  0.9443107
##   40  0.9965046  0.9433280
##   41  0.9965046  0.9432968
##   42  0.9965463  0.9439983
##   43  0.9965254  0.9436276
##   44  0.9964422  0.9421373
##   45  0.9964630  0.9425044
##   46  0.9964838  0.9427939
##   47  0.9965046  0.9431580
##   48  0.9964214  0.9417334
##   49  0.9964006  0.9413828
##   50  0.9962966  0.9396722
##   51  0.9963382  0.9403870
##   52  0.9963174  0.9400295
##   53  0.9963174  0.9399937
##   54  0.9962342  0.9386490
##   55  0.9962966  0.9396052
##   56  0.9962550  0.9389496
##   57  0.9962134  0.9382596
##   58  0.9961926  0.9379827
##   59  0.9962342  0.9385854
##   60  0.9961093  0.9365147
##   61  0.9961301  0.9368901
##   62  0.9961301  0.9369300
##   63  0.9961093  0.9366020
##   64  0.9961301  0.9369682
##   65  0.9961717  0.9376369
##   66  0.9961717  0.9376050
##   67  0.9961301  0.9368944
##   68  0.9962134  0.9382634
##   69  0.9961509  0.9372729
##   70  0.9960677  0.9359312
##   71  0.9960885  0.9362566
##   72  0.9961718  0.9376313
##   73  0.9961509  0.9372984
##   74  0.9960885  0.9362384
##   75  0.9960469  0.9354902
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 11.
bestk = classifier$bestTune

Testing across various values of K, we can see that our optimal K is 11. So we’ll proceed with the analysis using this value.

classifier %>% ggplot(aes(x=seq_along(Accuracy), y=Accuracy))

classifier$resample %>% 
  dplyr::group_split(Resample) %>% 
  purrr::map(rowid_to_column) %>%
  dplyr::bind_rows() %>%
  ggplot(aes(rowid, Accuracy)) + geom_point() +
  geom_smooth(formula='y~x', method='loess', span=.03) +
  geom_line(classifier$results, mapping=aes(seq_along(Accuracy), Accuracy),
            size=2, color='red')

Next we’ll look at the Confusion Matrix from the model. We’ll use the image below as a reference to analyzing the matrix.

Confusion Matrix Reference

Confusion Matrix Reference

# confusionMatrix
THRESHOLD <- 0.5
out_of_folds_CM <- classifier$pred %>% 
  dplyr::mutate(pred2 = ifelse(Up>THRESHOLD, 'Up', 'Down')) %>%
  dplyr::mutate(pred2 = factor(pred2, levels=c('Down','Up'))) %>%
  dplyr::group_split(Resample) %>%
  purrr::map(~ caret::confusionMatrix(data=.x$pred2, reference=.x$obs))

out_of_folds_CM_vis <- out_of_folds_CM %>% 
  purrr::map(~ .x$table %>% broom::tidy() %>%
                 ggplot(aes(x=Reference, y=Prediction, fill=n, label=n)) +
                 geom_tile() + geom_text(aes(label=n), size=5, color='white'))
 
gridExtra::grid.arrange(grobs=out_of_folds_CM_vis, ncol=2)

From the 10-Fold cross validation we can see we have our true positive references to a blue tarp. We can measure the sensitivity also known as the true positive rate. This is the fraction of observations that are correctly identified, using a given threshold value. The initial default threshold is 0.5. The false positive rate is 1-specificity; the fraction of observations that we classify incorrectly, using that same threshold value of 0.5. We can view the receiver operating characteristic curve or ROC curve to measure these two statistics. The ideal ROC curve hugs the top left corner, indicating a high true positive rate and a low false positive rate.

options(yardstick.event_first = FALSE)
ROC_curve <- predict(classifier, type='prob') %>% 
  yardstick::roc_curve(truth=train$BlueTarp_rest, estimate=Up) %>%
  dplyr::mutate(one_minus_specificity = 1-specificity)
## Warning: The `yardstick.event_first` option has been deprecated as of yardstick 0.0.7 and will be completely ignored in a future version.
## Instead, set the following argument directly in the metric function:
## `options(yardstick.event_first = TRUE)`  -> `event_level = 'first'` (the default)
## `options(yardstick.event_first = FALSE)` -> `event_level = 'second'`
## This warning is displayed once per session.
# Sensitivity: TP/(TP+FN)
# 1-Specificity: FP/(TN+FP)

ROC_curve_plot <- ROC_curve %>%
  ggplot(aes(x=one_minus_specificity, y=sensitivity)) + 
  geom_line() + geom_point() +
  geom_abline(slope=1, intercept=0, linetype='dashed', color='red') +
  xlab("one_minus_specificity\n(false positive rate)") +
  ggtitle('ROC curve')

ggplotly(ROC_curve_plot)
predict(classifier, type='prob') %>% 
  yardstick::roc_auc(truth=train$BlueTarp_rest, Up) 
## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary          1.00
ROC_curve %>% head()
## # A tibble: 6 x 4
##   .threshold specificity sensitivity one_minus_specificity
##        <dbl>       <dbl>       <dbl>                 <dbl>
## 1  -Inf            0           1                   1      
## 2     0            0           1                   1      
## 3     0.0672       0.991       1                   0.00939
## 4     0.0826       0.995       0.990               0.00462
## 5     0.0833       0.995       0.988               0.00462
## 6     0.0909       0.995       0.987               0.00460
predict(classifier, type='prob') %>% head()
##   Down Up
## 1    1  0
## 2    1  0
## 3    1  0
## 4    1  0
## 5    1  0
## 6    1  0

We can also look at each ROC curve from each fold from the KNN model.

ROC_curveS <- classifier$pred %>%
  dplyr::group_split(Resample) %>% 
  purrr::map(~ yardstick::roc_curve(.x, truth=obs, Up)) %>%
  purrr::map(~ dplyr::mutate(.x, one_minus_specificity = 1-specificity)) %>%
  purrr::map(~ geom_line(.x, mapping=aes(one_minus_specificity, sensitivity)))

ROC_curveS[[1]] <- ggplot() + ROC_curveS[[1]]
Reduce("+", ROC_curveS) + 
  geom_abline(slope=1, intercept=0, linetype='dashed', color='red') +
  xlab("one_minus_specificity\n(false positive rate)") +
  ggtitle('ROC Curves')

To find the optimal threshold, well look at using the \(F_1\) Score.

\(F_1\) score is more useful measure than accuracy for problems with uneven class distributions since \(F_1\) is defined as :

\[ F_1 = 2 * \frac{\frac{\Sigma True Positive}{\Sigma Predicted Condition Positive}*\frac{\Sigma True Positive}{\Sigma Condition Positive}}{\frac{\Sigma True Positive}{\Sigma Predicted Condition Positive}+\frac{\Sigma True Positive}{\Sigma Condition Positive}}\]

it takes into account both false positive and false negatives. The optimal \(F_1\) Score is between 0 and 1.

tuned_call <- function(preds, decision_rule_threshold){
  calls <- rep('Down', length(preds))
  calls[preds>decision_rule_threshold]='Up'
  factor(calls, levels=c('Down','Up'))
}


thresholds = seq(0.01,0.99,.01) # Avoid Threshold Tails
F_1_scores <- rep(NA, length(thresholds))
GMeans <- rep(NA, length(thresholds))

preds <- predict(classifier, type='prob')[,2]
y <- train$BlueTarp_rest
for (i in 1:length(thresholds)){
  yhat <- tuned_call(preds, thresholds[i])
  one_minus_specificity <- mean(yhat[y=='Down']=='Up')
  sensitivity <- mean(yhat[y=='Up']=='Up')
  one_minus_specificity
  sensitivity
  results <- tibble::tibble(y=y, yhat=yhat)
  
  caret::confusionMatrix(yhat, y, positive = "Up") 
  cm <- results %>%yardstick::conf_mat(truth=y, yhat)
  
  trueNegative  = cm$table[1,1]
  falseNegative = cm$table[1,2]
  falsePositive = cm$table[2,1]
  truePositve   = cm$table[2,2]
  
  #paste0("True Positive: ", truePositve )
  #paste0("False Negative: ", falseNegative )
  #paste0("True Negative: ", trueNegative )
  #paste0("False Positive: ", falsePositive )
  conditionPositive = truePositve + falseNegative
  conditionNegative = trueNegative + falsePositive 
  predictedConditionPositive = truePositve +  falsePositive 
  predictedConditionNegative = trueNegative + falseNegative
  
  TotalPopulation = truePositve + falsePositive + trueNegative + falseNegative
  
  prevalence = conditionPositive / TotalPopulation
  accuracy = (truePositve + trueNegative) / TotalPopulation
  PPV = truePositve / predictedConditionPositive
  FDR = falsePositive / predictedConditionPositive
  FOR = falseNegative / predictedConditionNegative
  NPV = trueNegative / predictedConditionNegative
  TPR = truePositve / conditionPositive
  FPR = falsePositive / conditionNegative
  FNR = falseNegative / conditionPositive
  SPC = TNR = trueNegative / conditionNegative
  LR.PLUS = TPR / FPR
  LR.MINUS = FNR/TNR
  DOR = LR.PLUS / LR.MINUS
  
  F_1_score = 2 * ((PPV * TPR)/(PPV + TPR))
  F_1_scores[i] <- F_1_score
  GMeans[i] = sqrt(TPR*SPC)
}

opt_thresh = thresholds[tail(which(na.omit(F_1_scores) == min(na.omit(F_1_scores))), n=1)]

if (opt_thresh < 0.1){opt_thresh = 0.1}
if (opt_thresh > 0.5){opt_thresh = 0.5}

#opt_thresh

#points(opt_thresh, F_1_scores[tail(which(na.omit(F_1_scores) == min(na.omit(F_1_scores))), n=1)], col = 'red', cex = 2, pch = 19)

If the optimal threshold is less than 0.1, we will set our optimal threshold to 0.1. We do this because we do not want to be too specific in our threshold. In real world example, we have to live with the fact that we cannot help all those affected due to limited resources. Our goal is to help as many as those affected, so we want to have a little more flexibility in our threshold.

plot(thresholds, F_1_scores, xlab = 'Thresholds', ylab = 'F1 Score', type = 'l')

plot(thresholds, GMeans, xlab = 'Thresholds', ylab = 'Geometric Mean', type = 'l')

thresholds[tail(which(na.omit(GMeans) == max(na.omit(GMeans))), n=1)]
## [1] 0.06
paste0("Confusion Matrix 0.5")
## [1] "Confusion Matrix 0.5"
y <- train$BlueTarp_rest
yhat <- tuned_call(preds, 0.5)
one_minus_specificity <- mean(yhat[y=='Down']=='Up')
sensitivity <- mean(yhat[y=='Up']=='Up')
results <- tibble::tibble(y=y, yhat=yhat)

caret::confusionMatrix(yhat, y, positive = "Up") 
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  Down    Up
##       Down 46457    53
##       Up      63  1490
##                                          
##                Accuracy : 0.9976         
##                  95% CI : (0.9971, 0.998)
##     No Information Rate : 0.9679         
##     P-Value [Acc > NIR] : <2e-16         
##                                          
##                   Kappa : 0.9613         
##                                          
##  Mcnemar's Test P-Value : 0.4034         
##                                          
##             Sensitivity : 0.96565        
##             Specificity : 0.99865        
##          Pos Pred Value : 0.95943        
##          Neg Pred Value : 0.99886        
##              Prevalence : 0.03210        
##          Detection Rate : 0.03100        
##    Detection Prevalence : 0.03231        
##       Balanced Accuracy : 0.98215        
##                                          
##        'Positive' Class : Up             
## 
cm <- results %>%yardstick::conf_mat(truth=y, yhat)

From the initial threshold we can see:

From our plot we selected our optimal threshold as 0.1.

yhat <- tuned_call(preds, opt_thresh)
one_minus_specificity <- mean(yhat[y=='Down']=='Up')
sensitivity <- mean(yhat[y=='Up']=='Up')
one_minus_specificity
## [1] 0.00361135
sensitivity
## [1] 0.985094
results <- tibble::tibble(y=y, yhat=yhat)

caret::confusionMatrix(yhat, y, positive = "Up") 
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  Down    Up
##       Down 46352    23
##       Up     168  1520
##                                           
##                Accuracy : 0.996           
##                  95% CI : (0.9954, 0.9966)
##     No Information Rate : 0.9679          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9388          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.98509         
##             Specificity : 0.99639         
##          Pos Pred Value : 0.90047         
##          Neg Pred Value : 0.99950         
##              Prevalence : 0.03210         
##          Detection Rate : 0.03163         
##    Detection Prevalence : 0.03512         
##       Balanced Accuracy : 0.99074         
##                                           
##        'Positive' Class : Up              
## 
cm <- results %>%yardstick::conf_mat(truth=y, yhat)

trueNegative  = cm$table[1,1]
falseNegative = cm$table[1,2]
falsePositive = cm$table[2,1]
truePositve   = cm$table[2,2]

conditionPositive = truePositve + falseNegative
conditionNegative = trueNegative + falsePositive 
predictedConditionPositive = truePositve +  falsePositive 
predictedConditionNegative = trueNegative + falseNegative

TotalPopulation = truePositve + falsePositive + trueNegative + falseNegative

prevalence = conditionPositive / TotalPopulation
accuracy = (truePositve + trueNegative) / TotalPopulation
PPV = truePositve / predictedConditionPositive
FDR = falsePositive / predictedConditionPositive
FOR = falseNegative / predictedConditionNegative
NPV = trueNegative / predictedConditionNegative
TPR = truePositve / conditionPositive
FPR = falsePositive / conditionNegative
FNR = falseNegative / conditionPositive
SPC = TNR = trueNegative / conditionNegative
LR.PLUS = TPR / FPR
LR.MINUS = FNR/TNR
DOR = LR.PLUS / LR.MINUS

F_1_score = 2 * ((PPV * TPR)/(PPV + TPR))

Now using the optimal threshold we get:

# confusionMatrix

THRESHOLD <- opt_thresh
out_of_folds_CM <- classifier$pred %>% 
  dplyr::mutate(pred2 = ifelse(Up>THRESHOLD, 'Up', 'Down')) %>%
  dplyr::mutate(pred2 = factor(pred2, levels=c('Down','Up'))) %>%
  dplyr::group_split(Resample) %>%
  purrr::map(~ caret::confusionMatrix(data=.x$pred2, reference=.x$obs))

out_of_folds_CM_vis <- out_of_folds_CM %>% 
  purrr::map(~ .x$table %>% broom::tidy() %>%
                 ggplot(aes(x=Reference, y=Prediction, fill=n, label=n)) +
                 geom_tile() + geom_text(aes(label=n), size=5, color='white'))
 
gridExtra::grid.arrange(grobs=out_of_folds_CM_vis, ncol=2)

Summary Stats K Nearest Neighbor

options(yardstick.event_first = FALSE)
ROC_curve <- predict(classifier, type='prob') %>% 
  yardstick::roc_curve(truth=train$BlueTarp_rest, estimate=Up) %>%
  dplyr::mutate(one_minus_specificity = 1-specificity)

# Sensitivity: TP/(TP+FN)
# 1-Specificity: FP/(TN+FP)

ROC_curve_plot <- ROC_curve %>%
  ggplot(aes(x=one_minus_specificity, y=sensitivity)) + 
  geom_line() + geom_point() +
  geom_abline(slope=1, intercept=0, linetype='dashed', color='red') +
  xlab("one_minus_specificity\n(false positive rate)") +
  ggtitle('ROC curve')

ggplotly(ROC_curve_plot)
temp = predict(classifier, type='prob') %>% yardstick::roc_auc(truth=train$BlueTarp_rest, Up) 

knn.Accuracy = accuracy
knn.AUC = temp$.estimate

twoclassSum = twoClassSummary( data = data.frame(obs = classifier$pred$obs,
  pred = classifier$pred$pred,
  Down = classifier$pred$Down, 
  Up = classifier$pred$Up), lev = levels(classifier$pred$obs))

knn.ROC = "TRUE" #unname(twoclassSum[1])
knn.Threshold = opt_thresh
knn.Sensitivity = knn.Recall= knn.Power =  TPR
knn.Specificity= 1-FPR
knn.FDR = FDR
knn.Precision=PPV
df = data.frame(c(knn.Accuracy, knn.AUC, knn.ROC, knn.Threshold, knn.Sensitivity, knn.Specificity, knn.FDR , knn.Precision))


colnames(df) <- c(paste0("KNN (k =",bestk,")"))
rownames(df) <- c("Accuracy","AUC", "ROC", "Threshold","Sensitivity=Recall=Power","Specificity=1-FPR","FDR","Precision=PPV")
df %>% kbl() %>% kable_styling()
KNN (k =11)
Accuracy 0.996026049143832
AUC 0.999812734291107
ROC TRUE
Threshold 0.1
Sensitivity=Recall=Power 0.985093972780298
Specificity=1-FPR 0.996388650042992
FDR 0.0995260663507109
Precision=PPV 0.900473933649289

LDA

The next classification process that we used is the linear discriminant analysis (LDA). LDA assumes that the observations within each class are drawn from a multivariate Gaussian distribution with a class specific mean vector and a covariance matrix that is common to all K classes. Again, we’ll use the caret package in R to apply the LDA classification procedure using a 10 - Fold Cross-Validation evaluation procedure.

set.seed(1)
train$BlueTarp_rest <- as.factor(train$BlueTarp_rest)
levels(train$BlueTarp_rest) <- c("Down","Up")
# https://blog.revolutionanalytics.com/2016/05/using-caret-to-compare-models.html

classifier <- caret::train(BlueTarp_rest ~ Red + Green + Blue, data=train, method="lda", 
              preProcess=c("center","scale"),
              #tuneGrid=data.frame(.dimen = seq(1,10,1)),#seq(1,1000,5)),
              trControl = caret::trainControl("cv", number=kfoldcv, 
              returnResamp='all', savePredictions='final',
              classProbs=TRUE))
classifier
## Linear Discriminant Analysis 
## 
## 48063 samples
##     3 predictor
##     2 classes: 'Down', 'Up' 
## 
## Pre-processing: centered (3), scaled (3) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 43257, 43256, 43257, 43256, 43257, 43257, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.9839793  0.7556508
classifier$resample %>% 
  dplyr::group_split(Resample) %>% 
  purrr::map(rowid_to_column) %>%
  dplyr::bind_rows() %>%
  ggplot(aes(rowid, Accuracy)) + geom_point() +
  geom_smooth(formula='y~x', method='loess', span=.03) +
  geom_line(classifier$results, mapping=aes(seq_along(Accuracy), Accuracy),
            size=2, color='red')

Next we’ll look at the Confusion Matrix from the model.

# confusionMatrix
THRESHOLD <- 0.5
out_of_folds_CM <- classifier$pred %>% 
  dplyr::mutate(pred2 = ifelse(Up>THRESHOLD, 'Up', 'Down')) %>%
  dplyr::mutate(pred2 = factor(pred2, levels=c('Down','Up'))) %>%
  dplyr::group_split(Resample) %>%
  purrr::map(~ caret::confusionMatrix(data=.x$pred2, reference=.x$obs))

out_of_folds_CM_vis <- out_of_folds_CM %>% 
  purrr::map(~ .x$table %>% broom::tidy() %>%
                 ggplot(aes(x=Reference, y=Prediction, fill=n, label=n)) +
                 geom_tile() + geom_text(aes(label=n), size=5, color='white'))
 
gridExtra::grid.arrange(grobs=out_of_folds_CM_vis, ncol=2)

From the 10-Fold cross validation we can see we have ~130 true positive references to a blue tarp. Slightly, less than the cross validation with KNN. We can measure the sensitivity and the 1-specificity from the initial default threshold is 0.5.

options(yardstick.event_first = FALSE)
ROC_curve <- predict(classifier, type='prob') %>% 
  yardstick::roc_curve(truth=train$BlueTarp_rest, estimate=Up) %>%
  dplyr::mutate(one_minus_specificity = 1-specificity)

# Sensitivity: TP/(TP+FN)
# 1-Specificity: FP/(TN+FP)

ROC_curve_plot <- ROC_curve %>%
  ggplot(aes(x=one_minus_specificity, y=sensitivity)) + 
  geom_line() + geom_point() +
  geom_abline(slope=1, intercept=0, linetype='dashed', color='red') +
  xlab("one_minus_specificity\n(false positive rate)") +
  ggtitle('ROC curve')

ggplotly(ROC_curve_plot)
predict(classifier, type='prob') %>% 
  yardstick::roc_auc(truth=train$BlueTarp_rest, Up) 
## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary         0.990
ROC_curve %>% head()
## # A tibble: 6 x 4
##    .threshold specificity sensitivity one_minus_specificity
##         <dbl>       <dbl>       <dbl>                 <dbl>
## 1 -Inf          0                   1                  1   
## 2    1.61e-16   0                   1                  1   
## 3    4.23e-16   0.0000215           1                  1.00
## 4    3.16e-15   0.0000430           1                  1.00
## 5    9.92e-15   0.0000645           1                  1.00
## 6    1.40e-14   0.0000860           1                  1.00
predict(classifier, type='prob') %>% head()
##        Down           Up
## 1 0.9999983 1.688204e-06
## 2 0.9999990 1.007633e-06
## 3 0.9991272 8.727504e-04
## 4 0.9998474 1.525823e-04
## 5 0.9512878 4.871219e-02
## 6 0.9999993 6.572897e-07

We can also look at each ROC curve from each fold from the LDA model.

ROC_curveS <- classifier$pred %>%
  dplyr::group_split(Resample) %>% 
  purrr::map(~ yardstick::roc_curve(.x, truth=obs, Up)) %>%
  purrr::map(~ dplyr::mutate(.x, one_minus_specificity = 1-specificity)) %>%
  purrr::map(~ geom_line(.x, mapping=aes(one_minus_specificity, sensitivity)))

ROC_curveS[[1]] <- ggplot() + ROC_curveS[[1]]
Reduce("+", ROC_curveS) + 
  geom_abline(slope=1, intercept=0, linetype='dashed', color='red') +
  xlab("one_minus_specificity\n(false positive rate)") +
  ggtitle('ROC Curves')

To find the optimal threshold, well look at using the \(F_1\) Score.

tuned_call <- function(preds, decision_rule_threshold){
  calls <- rep('Down', length(preds))
  calls[preds>decision_rule_threshold]='Up'
  factor(calls, levels=c('Down','Up'))
}


thresholds = seq(0.01,0.99,.01) # Avoid Threshold Tails
F_1_scores <- rep(NA, length(thresholds))
GMeans <- rep(NA, length(thresholds))

preds <- predict(classifier, type='prob')[,2]
y <- train$BlueTarp_rest
for (i in 1:length(thresholds)){
  yhat <- tuned_call(preds, thresholds[i])
  one_minus_specificity <- mean(yhat[y=='Down']=='Up')
  sensitivity <- mean(yhat[y=='Up']=='Up')
  one_minus_specificity
  sensitivity
  results <- tibble::tibble(y=y, yhat=yhat)
  
  caret::confusionMatrix(yhat, y, positive = "Up") 
  cm <- results %>%yardstick::conf_mat(truth=y, yhat)
  
  trueNegative  = cm$table[1,1]
  falseNegative = cm$table[1,2]
  falsePositive = cm$table[2,1]
  truePositve   = cm$table[2,2]
  
  #paste0("True Positive: ", truePositve )
  #paste0("False Negative: ", falseNegative )
  #paste0("True Negative: ", trueNegative )
  #paste0("False Positive: ", falsePositive )
  conditionPositive = truePositve + falseNegative
  conditionNegative = trueNegative + falsePositive 
  predictedConditionPositive = truePositve +  falsePositive 
  predictedConditionNegative = trueNegative + falseNegative
  
  TotalPopulation = truePositve + falsePositive + trueNegative + falseNegative
  
  prevalence = conditionPositive / TotalPopulation
  accuracy = (truePositve + trueNegative) / TotalPopulation
  PPV = truePositve / predictedConditionPositive
  FDR = falsePositive / predictedConditionPositive
  FOR = falseNegative / predictedConditionNegative
  NPV = trueNegative / predictedConditionNegative
  TPR = truePositve / conditionPositive
  FPR = falsePositive / conditionNegative
  FNR = falseNegative / conditionPositive
  SPC = TNR = trueNegative / conditionNegative
  LR.PLUS = TPR / FPR
  LR.MINUS = FNR/TNR
  DOR = LR.PLUS / LR.MINUS
  
  F_1_score = 2 * ((PPV * TPR)/(PPV + TPR))
  F_1_scores[i] <- F_1_score
  GMeans[i] = sqrt(TPR*SPC)
}

opt_thresh = thresholds[tail(which(na.omit(F_1_scores) == min(na.omit(F_1_scores))), n=1)]

if (opt_thresh < 0.1){opt_thresh = 0.1}
if (opt_thresh > 0.5){opt_thresh = 0.5}

#opt_thresh

#points(opt_thresh, F_1_scores[tail(which(na.omit(F_1_scores) == min(na.omit(F_1_scores))), n=1)], col = 'red', cex = 2, pch = 19)
plot(thresholds, F_1_scores, xlab = 'Thresholds', ylab = 'F1 Score', type = 'l')

plot(thresholds, GMeans, xlab = 'Thresholds', ylab = 'Geometric Mean', type = 'l')

thresholds[tail(which(na.omit(GMeans) == max(na.omit(GMeans))), n=1)]
## [1] 0.01
paste0("Confusion Matrix 0.5")
## [1] "Confusion Matrix 0.5"
y <- train$BlueTarp_rest
yhat <- tuned_call(preds, 0.5)
one_minus_specificity <- mean(yhat[y=='Down']=='Up')
sensitivity <- mean(yhat[y=='Up']=='Up')
results <- tibble::tibble(y=y, yhat=yhat)

caret::confusionMatrix(yhat, y, positive = "Up") 
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  Down    Up
##       Down 46054   296
##       Up     466  1247
##                                          
##                Accuracy : 0.9841         
##                  95% CI : (0.983, 0.9852)
##     No Information Rate : 0.9679         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.7578         
##                                          
##  Mcnemar's Test P-Value : 9.228e-10      
##                                          
##             Sensitivity : 0.80817        
##             Specificity : 0.98998        
##          Pos Pred Value : 0.72796        
##          Neg Pred Value : 0.99361        
##              Prevalence : 0.03210        
##          Detection Rate : 0.02595        
##    Detection Prevalence : 0.03564        
##       Balanced Accuracy : 0.89907        
##                                          
##        'Positive' Class : Up             
## 
cm <- results %>%yardstick::conf_mat(truth=y, yhat)

From the initial threshold we can see:

From our plot we selected our optimal threshold as 0.1.

yhat <- tuned_call(preds, opt_thresh)
one_minus_specificity <- mean(yhat[y=='Down']=='Up')
sensitivity <- mean(yhat[y=='Up']=='Up')
one_minus_specificity
## [1] 0.01319862
sensitivity
## [1] 0.8561244
results <- tibble::tibble(y=y, yhat=yhat)

caret::confusionMatrix(yhat, y, positive = "Up") 
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  Down    Up
##       Down 45906   222
##       Up     614  1321
##                                           
##                Accuracy : 0.9826          
##                  95% CI : (0.9814, 0.9838)
##     No Information Rate : 0.9679          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7507          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.85612         
##             Specificity : 0.98680         
##          Pos Pred Value : 0.68269         
##          Neg Pred Value : 0.99519         
##              Prevalence : 0.03210         
##          Detection Rate : 0.02748         
##    Detection Prevalence : 0.04026         
##       Balanced Accuracy : 0.92146         
##                                           
##        'Positive' Class : Up              
## 
cm <- results %>%yardstick::conf_mat(truth=y, yhat)

trueNegative  = cm$table[1,1]
falseNegative = cm$table[1,2]
falsePositive = cm$table[2,1]
truePositve   = cm$table[2,2]

conditionPositive = truePositve + falseNegative
conditionNegative = trueNegative + falsePositive 
predictedConditionPositive = truePositve +  falsePositive 
predictedConditionNegative = trueNegative + falseNegative

TotalPopulation = truePositve + falsePositive + trueNegative + falseNegative

prevalence = conditionPositive / TotalPopulation
accuracy = (truePositve + trueNegative) / TotalPopulation
PPV = truePositve / predictedConditionPositive
FDR = falsePositive / predictedConditionPositive
FOR = falseNegative / predictedConditionNegative
NPV = trueNegative / predictedConditionNegative
TPR = truePositve / conditionPositive
FPR = falsePositive / conditionNegative
FNR = falseNegative / conditionPositive
SPC = TNR = trueNegative / conditionNegative
LR.PLUS = TPR / FPR
LR.MINUS = FNR/TNR
DOR = LR.PLUS / LR.MINUS

F_1_score = 2 * ((PPV * TPR)/(PPV + TPR))

Now using the optimal threshold we get:

# confusionMatrix

THRESHOLD <- opt_thresh
out_of_folds_CM <- classifier$pred %>% 
  dplyr::mutate(pred2 = ifelse(Up>THRESHOLD, 'Up', 'Down')) %>%
  dplyr::mutate(pred2 = factor(pred2, levels=c('Down','Up'))) %>%
  dplyr::group_split(Resample) %>%
  purrr::map(~ caret::confusionMatrix(data=.x$pred2, reference=.x$obs))

out_of_folds_CM_vis <- out_of_folds_CM %>% 
  purrr::map(~ .x$table %>% broom::tidy() %>%
                 ggplot(aes(x=Reference, y=Prediction, fill=n, label=n)) +
                 geom_tile() + geom_text(aes(label=n), size=5, color='white'))
 
gridExtra::grid.arrange(grobs=out_of_folds_CM_vis, ncol=2)

Summary Stats LDA

options(yardstick.event_first = FALSE)
ROC_curve <- predict(classifier, type='prob') %>% 
  yardstick::roc_curve(truth=train$BlueTarp_rest, estimate=Up) %>%
  dplyr::mutate(one_minus_specificity = 1-specificity)

# Sensitivity: TP/(TP+FN)
# 1-Specificity: FP/(TN+FP)

ROC_curve_plot <- ROC_curve %>%
  ggplot(aes(x=one_minus_specificity, y=sensitivity)) + 
  geom_line() + geom_point() +
  geom_abline(slope=1, intercept=0, linetype='dashed', color='red') +
  xlab("one_minus_specificity\n(false positive rate)") +
  ggtitle('ROC curve')

ggplotly(ROC_curve_plot)
temp = predict(classifier, type='prob') %>% yardstick::roc_auc(truth=train$BlueTarp_rest, Up) 

lda.Accuracy = accuracy
lda.AUC = temp$.estimate
twoclassSum = twoClassSummary( data = data.frame(obs = classifier$pred$obs,
  pred = classifier$pred$pred,
  Down = classifier$pred$Down, 
  Up = classifier$pred$Up), lev = levels(classifier$pred$obs))

lda.ROC = "TRUE" #unname(twoclassSum[1])
lda.Threshold = opt_thresh
lda.Sensitivity = lda.Recall= lda.Power =  TPR
lda.Specificity= 1-FPR
lda.FDR = FDR
lda.Precision=PPV
df = data.frame(c(lda.Accuracy, lda.AUC, lda.ROC, lda.Threshold, lda.Sensitivity, lda.Specificity, lda.FDR , lda.Precision))

colnames(df) <- c("LDA")
rownames(df) <- c("Accuracy","AUC", "ROC", "Threshold","Sensitivity=Recall=Power","Specificity=1-FPR","FDR","Precision=PPV")
df %>% kbl() %>% kable_styling()
LDA
Accuracy 0.982606162744731
AUC 0.989589213539748
ROC TRUE
Threshold 0.1
Sensitivity=Recall=Power 0.856124432922878
Specificity=1-FPR 0.986801375752365
FDR 0.317312661498708
Precision=PPV 0.682687338501292

QDA

The next classification process that we used is the quadratic discriminant analysis (qda). Like LDA, the QDA classifier results from assuming that the observations from each class are drawn from a Gaussian distribution, and plugging estimates for the parameters into Bayes’ theorem in order to perform prediction. Again, we’ll use the caret package in R to apply the qda classification procedure using a 10 - Fold Cross-Validation evaluation procedure.

set.seed(1)
train$BlueTarp_rest <- as.factor(train$BlueTarp_rest)
levels(train$BlueTarp_rest) <- c("Down","Up")
# https://blog.revolutionanalytics.com/2016/05/using-caret-to-compare-models.html

classifier <- caret::train(BlueTarp_rest ~ Red + Green + Blue, data=train, method="qda", 
                           preProcess=c("center","scale"),
                           #tuneGrid=data.frame(.dimen = seq(1,10,1)),#seq(1,1000,5)),
                           trControl = caret::trainControl("cv", number=kfoldcv, 
                                                           returnResamp='all', savePredictions='final',
                                                           classProbs=TRUE))
classifier
## Quadratic Discriminant Analysis 
## 
## 48063 samples
##     3 predictor
##     2 classes: 'Down', 'Up' 
## 
## Pre-processing: centered (3), scaled (3) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 43257, 43256, 43257, 43256, 43257, 43257, ... 
## Resampling results:
## 
##   Accuracy  Kappa    
##   0.994861  0.9110695
classifier$resample %>% 
  dplyr::group_split(Resample) %>% 
  purrr::map(rowid_to_column) %>%
  dplyr::bind_rows() %>%
  ggplot(aes(rowid, Accuracy)) + geom_point() +
  geom_smooth(formula='y~x', method='loess', span=.03) +
  geom_line(classifier$results, mapping=aes(seq_along(Accuracy), Accuracy),
            size=2, color='red')

Next we’ll look at the Confusion Matrix from the model.

# confusionMatrix
THRESHOLD <- 0.5
out_of_folds_CM <- classifier$pred %>% 
  dplyr::mutate(pred2 = ifelse(Up>THRESHOLD, 'Up', 'Down')) %>%
  dplyr::mutate(pred2 = factor(pred2, levels=c('Down','Up'))) %>%
  dplyr::group_split(Resample) %>%
  purrr::map(~ caret::confusionMatrix(data=.x$pred2, reference=.x$obs))

out_of_folds_CM_vis <- out_of_folds_CM %>% 
  purrr::map(~ .x$table %>% broom::tidy() %>%
                 ggplot(aes(x=Reference, y=Prediction, fill=n, label=n)) +
                 geom_tile() + geom_text(aes(label=n), size=5, color='white'))
 
gridExtra::grid.arrange(grobs=out_of_folds_CM_vis, ncol=2)

From the 10-Fold cross validation we can see we have true positive references to a blue tarp on par with what we see in LDA. We can measure the sensitivity and the 1-specificity from the initial default threshold is 0.5.

options(yardstick.event_first = FALSE)
ROC_curve <- predict(classifier, type='prob') %>% 
  yardstick::roc_curve(truth=train$BlueTarp_rest, estimate=Up) %>%
  dplyr::mutate(one_minus_specificity = 1-specificity)

# Sensitivity: TP/(TP+FN)
# 1-Specificity: FP/(TN+FP)

ROC_curve_plot <- ROC_curve %>%
  ggplot(aes(x=one_minus_specificity, y=sensitivity)) + 
  geom_line() + geom_point() +
  geom_abline(slope=1, intercept=0, linetype='dashed', color='red') +
  xlab("one_minus_specificity\n(false positive rate)") +
  ggtitle('ROC curve')

ggplotly(ROC_curve_plot)
predict(classifier, type='prob') %>% 
  yardstick::roc_auc(truth=train$BlueTarp_rest, Up) 
## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary         0.998
ROC_curve %>% head()
## # A tibble: 6 x 4
##    .threshold specificity sensitivity one_minus_specificity
##         <dbl>       <dbl>       <dbl>                 <dbl>
## 1 -Inf          0                   1                  1   
## 2    6.67e-31   0                   1                  1   
## 3    6.20e-29   0.0000215           1                  1.00
## 4    6.95e-29   0.0000430           1                  1.00
## 5    2.33e-28   0.0000645           1                  1.00
## 6    6.82e-28   0.0000860           1                  1.00
predict(classifier, type='prob') %>% head()
##        Down           Up
## 1 1.0000000 9.445049e-09
## 2 0.9979097 2.090319e-03
## 3 0.9999612 3.875654e-05
## 4 0.9999984 1.558411e-06
## 5 0.9996290 3.710440e-04
## 6 0.9999469 5.309289e-05

We can also look at each ROC curve from each fold from the QDA model.

ROC_curveS <- classifier$pred %>%
  dplyr::group_split(Resample) %>% 
  purrr::map(~ yardstick::roc_curve(.x, truth=obs, Up)) %>%
  purrr::map(~ dplyr::mutate(.x, one_minus_specificity = 1-specificity)) %>%
  purrr::map(~ geom_line(.x, mapping=aes(one_minus_specificity, sensitivity)))

ROC_curveS[[1]] <- ggplot() + ROC_curveS[[1]]
Reduce("+", ROC_curveS) + 
  geom_abline(slope=1, intercept=0, linetype='dashed', color='red') +
  xlab("one_minus_specificity\n(false positive rate)") +
  ggtitle('ROC Curves')

To find the optimal threshold, well look at using the \(F_1\) Score.

tuned_call <- function(preds, decision_rule_threshold){
  calls <- rep('Down', length(preds))
  calls[preds>decision_rule_threshold]='Up'
  factor(calls, levels=c('Down','Up'))
}


thresholds = seq(0.01,0.99,.01) # Avoid Threshold Tails
F_1_scores <- rep(NA, length(thresholds))
GMeans <- rep(NA, length(thresholds))

preds <- predict(classifier, type='prob')[,2]
y <- train$BlueTarp_rest
for (i in 1:length(thresholds)){
  yhat <- tuned_call(preds, thresholds[i])
  one_minus_specificity <- mean(yhat[y=='Down']=='Up')
  sensitivity <- mean(yhat[y=='Up']=='Up')
  one_minus_specificity
  sensitivity
  results <- tibble::tibble(y=y, yhat=yhat)
  
  caret::confusionMatrix(yhat, y, positive = "Up") 
  cm <- results %>%yardstick::conf_mat(truth=y, yhat)
  
  trueNegative  = cm$table[1,1]
  falseNegative = cm$table[1,2]
  falsePositive = cm$table[2,1]
  truePositve   = cm$table[2,2]
  
  #paste0("True Positive: ", truePositve )
  #paste0("False Negative: ", falseNegative )
  #paste0("True Negative: ", trueNegative )
  #paste0("False Positive: ", falsePositive )
  conditionPositive = truePositve + falseNegative
  conditionNegative = trueNegative + falsePositive 
  predictedConditionPositive = truePositve +  falsePositive 
  predictedConditionNegative = trueNegative + falseNegative
  
  TotalPopulation = truePositve + falsePositive + trueNegative + falseNegative
  
  prevalence = conditionPositive / TotalPopulation
  accuracy = (truePositve + trueNegative) / TotalPopulation
  PPV = truePositve / predictedConditionPositive
  FDR = falsePositive / predictedConditionPositive
  FOR = falseNegative / predictedConditionNegative
  NPV = trueNegative / predictedConditionNegative
  TPR = truePositve / conditionPositive
  FPR = falsePositive / conditionNegative
  FNR = falseNegative / conditionPositive
  SPC = TNR = trueNegative / conditionNegative
  LR.PLUS = TPR / FPR
  LR.MINUS = FNR/TNR
  DOR = LR.PLUS / LR.MINUS
  
  F_1_score = 2 * ((PPV * TPR)/(PPV + TPR))
  F_1_scores[i] <- F_1_score
  GMeans[i] = sqrt(TPR*SPC)
}

opt_thresh = thresholds[tail(which(na.omit(F_1_scores) == min(na.omit(F_1_scores))), n=1)]

if (opt_thresh < 0.1){opt_thresh = 0.1}
if (opt_thresh > 0.5){opt_thresh = 0.5}

#opt_thresh

#points(opt_thresh, F_1_scores[tail(which(na.omit(F_1_scores) == min(na.omit(F_1_scores))), n=1)], col = 'red', cex = 2, pch = 19)
plot(thresholds, F_1_scores, xlab = 'Thresholds', ylab = 'F1 Score', type = 'l')

plot(thresholds, GMeans, xlab = 'Thresholds', ylab = 'Geometric Mean', type = 'l')

thresholds[tail(which(na.omit(GMeans) == max(na.omit(GMeans))), n=1)]
## [1] 0.02
paste0("Confusion Matrix 0.5")
## [1] "Confusion Matrix 0.5"
y <- train$BlueTarp_rest
yhat <- tuned_call(preds, 0.5)
one_minus_specificity <- mean(yhat[y=='Down']=='Up')
sensitivity <- mean(yhat[y=='Up']=='Up')
results <- tibble::tibble(y=y, yhat=yhat)

caret::confusionMatrix(yhat, y, positive = "Up") 
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  Down    Up
##       Down 46506   232
##       Up      14  1311
##                                           
##                Accuracy : 0.9949          
##                  95% CI : (0.9942, 0.9955)
##     No Information Rate : 0.9679          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9116          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.84964         
##             Specificity : 0.99970         
##          Pos Pred Value : 0.98943         
##          Neg Pred Value : 0.99504         
##              Prevalence : 0.03210         
##          Detection Rate : 0.02728         
##    Detection Prevalence : 0.02757         
##       Balanced Accuracy : 0.92467         
##                                           
##        'Positive' Class : Up              
## 
cm <- results %>%yardstick::conf_mat(truth=y, yhat)

From the initial threshold we can see:

From our plot we selected our optimal threshold as 0.1.

yhat <- tuned_call(preds, opt_thresh)
one_minus_specificity <- mean(yhat[y=='Down']=='Up')
sensitivity <- mean(yhat[y=='Up']=='Up')
one_minus_specificity
## [1] 0.007050731
sensitivity
## [1] 0.9086196
results <- tibble::tibble(y=y, yhat=yhat)

caret::confusionMatrix(yhat, y, positive = "Up") 
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  Down    Up
##       Down 46192   141
##       Up     328  1402
##                                           
##                Accuracy : 0.9902          
##                  95% CI : (0.9893, 0.9911)
##     No Information Rate : 0.9679          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8517          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.90862         
##             Specificity : 0.99295         
##          Pos Pred Value : 0.81040         
##          Neg Pred Value : 0.99696         
##              Prevalence : 0.03210         
##          Detection Rate : 0.02917         
##    Detection Prevalence : 0.03599         
##       Balanced Accuracy : 0.95078         
##                                           
##        'Positive' Class : Up              
## 
cm <- results %>%yardstick::conf_mat(truth=y, yhat)

trueNegative  = cm$table[1,1]
falseNegative = cm$table[1,2]
falsePositive = cm$table[2,1]
truePositve   = cm$table[2,2]

conditionPositive = truePositve + falseNegative
conditionNegative = trueNegative + falsePositive 
predictedConditionPositive = truePositve +  falsePositive 
predictedConditionNegative = trueNegative + falseNegative

TotalPopulation = truePositve + falsePositive + trueNegative + falseNegative

prevalence = conditionPositive / TotalPopulation
accuracy = (truePositve + trueNegative) / TotalPopulation
PPV = truePositve / predictedConditionPositive
FDR = falsePositive / predictedConditionPositive
FOR = falseNegative / predictedConditionNegative
NPV = trueNegative / predictedConditionNegative
TPR = truePositve / conditionPositive
FPR = falsePositive / conditionNegative
FNR = falseNegative / conditionPositive
SPC = TNR = trueNegative / conditionNegative
LR.PLUS = TPR / FPR
LR.MINUS = FNR/TNR
DOR = LR.PLUS / LR.MINUS

F_1_score = 2 * ((PPV * TPR)/(PPV + TPR))

Now using the optimal threshold we get:

# confusionMatrix

THRESHOLD <- opt_thresh
out_of_folds_CM <- classifier$pred %>% 
  dplyr::mutate(pred2 = ifelse(Up>THRESHOLD, 'Up', 'Down')) %>%
  dplyr::mutate(pred2 = factor(pred2, levels=c('Down','Up'))) %>%
  dplyr::group_split(Resample) %>%
  purrr::map(~ caret::confusionMatrix(data=.x$pred2, reference=.x$obs))

out_of_folds_CM_vis <- out_of_folds_CM %>% 
  purrr::map(~ .x$table %>% broom::tidy() %>%
                 ggplot(aes(x=Reference, y=Prediction, fill=n, label=n)) +
                 geom_tile() + geom_text(aes(label=n), size=5, color='white'))
 
gridExtra::grid.arrange(grobs=out_of_folds_CM_vis, ncol=2)

Summary Stats QDA

options(yardstick.event_first = FALSE)
ROC_curve <- predict(classifier, type='prob') %>% 
  yardstick::roc_curve(truth=train$BlueTarp_rest, estimate=Up) %>%
  dplyr::mutate(one_minus_specificity = 1-specificity)

# Sensitivity: TP/(TP+FN)
# 1-Specificity: FP/(TN+FP)

ROC_curve_plot <- ROC_curve %>%
  ggplot(aes(x=one_minus_specificity, y=sensitivity)) + 
  geom_line() + geom_point() +
  geom_abline(slope=1, intercept=0, linetype='dashed', color='red') +
  xlab("one_minus_specificity\n(false positive rate)") +
  ggtitle('ROC curve')

ggplotly(ROC_curve_plot)
temp = predict(classifier, type='prob') %>% yardstick::roc_auc(truth=train$BlueTarp_rest, Up) 

qda.Accuracy = accuracy
qda.AUC = temp$.estimate
twoclassSum = twoClassSummary( data = data.frame(obs = classifier$pred$obs,
  pred = classifier$pred$pred,
  Down = classifier$pred$Down, 
  Up = classifier$pred$Up), lev = levels(classifier$pred$obs))

qda.ROC = "TRUE" #unname(twoclassSum[1])
qda.Threshold = opt_thresh
qda.Sensitivity = qda.Recall= qda.Power =  TPR
qda.Specificity= 1-FPR
qda.FDR = FDR
qda.Precision=PPV
df = data.frame(c(qda.Accuracy, qda.AUC, qda.ROC, qda.Threshold, qda.Sensitivity, qda.Specificity, qda.FDR , qda.Precision))

colnames(df) <- c("QDA")
rownames(df) <- c("Accuracy","AUC", "ROC", "Threshold","Sensitivity=Recall=Power","Specificity=1-FPR","FDR","Precision=PPV")
df %>% kbl() %>% kable_styling()
QDA
Accuracy 0.990241974075692
AUC 0.998273998625808
ROC TRUE
Threshold 0.1
Sensitivity=Recall=Power 0.908619572261828
Specificity=1-FPR 0.992949269131556
FDR 0.189595375722543
Precision=PPV 0.810404624277457

Logistic Regression

Our final classification process that we used is the linear discriminant analysis (glm). Again, we’ll use the caret package in R to apply the glm classification procedure using a 10 - Fold Cross-Validation evaluation procedure.

set.seed(1)
train$BlueTarp_rest <- as.factor(train$BlueTarp_rest)
levels(train$BlueTarp_rest) <- c("Down","Up")
# https://blog.revolutionanalytics.com/2016/05/using-caret-to-compare-models.html

classifier <- caret::train(BlueTarp_rest ~ Red + Green + Blue, data=train, method="glm", family=binomial(),
      preProcess=c("center","scale"), trControl = caret::trainControl("cv", number=kfoldcv, 
      returnResamp='all', savePredictions='final',classProbs=TRUE))
classifier
## Generalized Linear Model 
## 
## 48063 samples
##     3 predictor
##     2 classes: 'Down', 'Up' 
## 
## Pre-processing: centered (3), scaled (3) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 43257, 43256, 43257, 43256, 43257, 43257, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.9954435  0.9238951
classifier$resample %>% 
  dplyr::group_split(Resample) %>% 
  purrr::map(rowid_to_column) %>%
  dplyr::bind_rows() %>%
  ggplot(aes(rowid, Accuracy)) + geom_point() +
  geom_smooth(formula='y~x', method='loess', span=.03) +
  geom_line(classifier$results, mapping=aes(seq_along(Accuracy), Accuracy),
            size=2, color='red')

Next we’ll look at the Confusion Matrix from the model.

# confusionMatrix
THRESHOLD <- 0.5
out_of_folds_CM <- classifier$pred %>% 
  dplyr::mutate(pred2 = ifelse(Up>THRESHOLD, 'Up', 'Down')) %>%
  dplyr::mutate(pred2 = factor(pred2, levels=c('Down','Up'))) %>%
  dplyr::group_split(Resample) %>%
  purrr::map(~ caret::confusionMatrix(data=.x$pred2, reference=.x$obs))

out_of_folds_CM_vis <- out_of_folds_CM %>% 
  purrr::map(~ .x$table %>% broom::tidy() %>%
                 ggplot(aes(x=Reference, y=Prediction, fill=n, label=n)) +
                 geom_tile() + geom_text(aes(label=n), size=5, color='white'))
 
gridExtra::grid.arrange(grobs=out_of_folds_CM_vis, ncol=2)

From the 10-Fold cross validation we can see we have ~130-140 true positive references to a blue tarp. We can measure the sensitivity and the 1-specificity from the initial default threshold is 0.5.

options(yardstick.event_first = FALSE)
ROC_curve <- predict(classifier, type='prob') %>% 
  yardstick::roc_curve(truth=train$BlueTarp_rest, estimate=Up) %>%
  dplyr::mutate(one_minus_specificity = 1-specificity)

# Sensitivity: TP/(TP+FN)
# 1-Specificity: FP/(TN+FP)

ROC_curve_plot <- ROC_curve %>%
  ggplot(aes(x=one_minus_specificity, y=sensitivity)) + 
  geom_line() + geom_point() +
  geom_abline(slope=1, intercept=0, linetype='dashed', color='red') +
  xlab("one_minus_specificity\n(false positive rate)") +
  ggtitle('ROC curve')

ggplotly(ROC_curve_plot)
predict(classifier, type='prob') %>% 
  yardstick::roc_auc(truth=train$BlueTarp_rest, Up) 
## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary         0.999
ROC_curve %>% head()
## # A tibble: 6 x 4
##    .threshold specificity sensitivity one_minus_specificity
##         <dbl>       <dbl>       <dbl>                 <dbl>
## 1 -Inf              0               1                 1    
## 2    2.22e-16       0               1                 1    
## 3    9.36e-14       0.220           1                 0.780
## 4    9.36e-14       0.220           1                 0.780
## 5    9.38e-14       0.220           1                 0.780
## 6    9.38e-14       0.220           1                 0.780
predict(classifier, type='prob') %>% head()
##        Down           Up
## 1 1.0000000 2.864604e-12
## 2 0.9992914 7.086498e-04
## 3 0.9999997 2.521973e-07
## 4 1.0000000 1.031507e-09
## 5 0.9999971 2.906881e-06
## 6 0.9944786 5.521443e-03

We can also look at each ROC curve from each fold from the glm model.

ROC_curveS <- classifier$pred %>%
  dplyr::group_split(Resample) %>% 
  purrr::map(~ yardstick::roc_curve(.x, truth=obs, Up)) %>%
  purrr::map(~ dplyr::mutate(.x, one_minus_specificity = 1-specificity)) %>%
  purrr::map(~ geom_line(.x, mapping=aes(one_minus_specificity, sensitivity)))

ROC_curveS[[1]] <- ggplot() + ROC_curveS[[1]]
Reduce("+", ROC_curveS) + 
  geom_abline(slope=1, intercept=0, linetype='dashed', color='red') +
  xlab("one_minus_specificity\n(false positive rate)") +
  ggtitle('ROC Curves')

To find the optimal threshold, well look at using the \(F_1\) Score.

tuned_call <- function(preds, decision_rule_threshold){
  calls <- rep('Down', length(preds))
  calls[preds>decision_rule_threshold]='Up'
  factor(calls, levels=c('Down','Up'))
}


thresholds = seq(0.01,0.99,.01) # Avoid Threshold Tails
F_1_scores <- rep(NA, length(thresholds))
GMeans <- rep(NA, length(thresholds))

preds <- predict(classifier, type='prob')[,2]
y <- train$BlueTarp_rest
for (i in 1:length(thresholds)){
  yhat <- tuned_call(preds, thresholds[i])
  one_minus_specificity <- mean(yhat[y=='Down']=='Up')
  sensitivity <- mean(yhat[y=='Up']=='Up')
  one_minus_specificity
  sensitivity
  results <- tibble::tibble(y=y, yhat=yhat)
  
  caret::confusionMatrix(yhat, y, positive = "Up") 
  cm <- results %>%yardstick::conf_mat(truth=y, yhat)
  
  trueNegative  = cm$table[1,1]
  falseNegative = cm$table[1,2]
  falsePositive = cm$table[2,1]
  truePositve   = cm$table[2,2]
  
  #paste0("True Positive: ", truePositve )
  #paste0("False Negative: ", falseNegative )
  #paste0("True Negative: ", trueNegative )
  #paste0("False Positive: ", falsePositive )
  conditionPositive = truePositve + falseNegative
  conditionNegative = trueNegative + falsePositive 
  predictedConditionPositive = truePositve +  falsePositive 
  predictedConditionNegative = trueNegative + falseNegative
  
  TotalPopulation = truePositve + falsePositive + trueNegative + falseNegative
  
  prevalence = conditionPositive / TotalPopulation
  accuracy = (truePositve + trueNegative) / TotalPopulation
  PPV = truePositve / predictedConditionPositive
  FDR = falsePositive / predictedConditionPositive
  FOR = falseNegative / predictedConditionNegative
  NPV = trueNegative / predictedConditionNegative
  TPR = truePositve / conditionPositive
  FPR = falsePositive / conditionNegative
  FNR = falseNegative / conditionPositive
  SPC = TNR = trueNegative / conditionNegative
  LR.PLUS = TPR / FPR
  LR.MINUS = FNR/TNR
  DOR = LR.PLUS / LR.MINUS
  
  F_1_score = 2 * ((PPV * TPR)/(PPV + TPR))
  F_1_scores[i] <- F_1_score
  GMeans[i] = sqrt(TPR*SPC)
}

opt_thresh = thresholds[tail(which(na.omit(F_1_scores) == min(na.omit(F_1_scores))), n=1)]

if (opt_thresh < 0.1){opt_thresh = 0.1}
if (opt_thresh > 0.5){opt_thresh = 0.5}

#opt_thresh

#points(opt_thresh, F_1_scores[tail(which(na.omit(F_1_scores) == min(na.omit(F_1_scores))), n=1)], col = 'red', cex = 2, pch = 19)
plot(thresholds, F_1_scores, xlab = 'Thresholds', ylab = 'F1 Score', type = 'l')

plot(thresholds, GMeans, xlab = 'Thresholds', ylab = 'Geometric Mean', type = 'l')

thresholds[tail(which(na.omit(GMeans) == max(na.omit(GMeans))), n=1)]
## [1] 0.04
paste0("Confusion Matrix 0.5")
## [1] "Confusion Matrix 0.5"
y <- train$BlueTarp_rest
yhat <- tuned_call(preds, 0.5)
one_minus_specificity <- mean(yhat[y=='Down']=='Up')
sensitivity <- mean(yhat[y=='Up']=='Up')
results <- tibble::tibble(y=y, yhat=yhat)

caret::confusionMatrix(yhat, y, positive = "Up") 
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  Down    Up
##       Down 46467   166
##       Up      53  1377
##                                          
##                Accuracy : 0.9954         
##                  95% CI : (0.9948, 0.996)
##     No Information Rate : 0.9679         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.924          
##                                          
##  Mcnemar's Test P-Value : 3.783e-14      
##                                          
##             Sensitivity : 0.89242        
##             Specificity : 0.99886        
##          Pos Pred Value : 0.96294        
##          Neg Pred Value : 0.99644        
##              Prevalence : 0.03210        
##          Detection Rate : 0.02865        
##    Detection Prevalence : 0.02975        
##       Balanced Accuracy : 0.94564        
##                                          
##        'Positive' Class : Up             
## 
cm <- results %>%yardstick::conf_mat(truth=y, yhat)

From the initial threshold we can see:

From our plot we selected our optimal threshold as 0.1.

yhat <- tuned_call(preds, opt_thresh)
one_minus_specificity <- mean(yhat[y=='Down']=='Up')
sensitivity <- mean(yhat[y=='Up']=='Up')
one_minus_specificity
## [1] 0.00827601
sensitivity
## [1] 0.9643552
results <- tibble::tibble(y=y, yhat=yhat)

caret::confusionMatrix(yhat, y, positive = "Up") 
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  Down    Up
##       Down 46135    55
##       Up     385  1488
##                                         
##                Accuracy : 0.9908        
##                  95% CI : (0.99, 0.9917)
##     No Information Rate : 0.9679        
##     P-Value [Acc > NIR] : < 2.2e-16     
##                                         
##                   Kappa : 0.8665        
##                                         
##  Mcnemar's Test P-Value : < 2.2e-16     
##                                         
##             Sensitivity : 0.96436       
##             Specificity : 0.99172       
##          Pos Pred Value : 0.79445       
##          Neg Pred Value : 0.99881       
##              Prevalence : 0.03210       
##          Detection Rate : 0.03096       
##    Detection Prevalence : 0.03897       
##       Balanced Accuracy : 0.97804       
##                                         
##        'Positive' Class : Up            
## 
cm <- results %>%yardstick::conf_mat(truth=y, yhat)

trueNegative  = cm$table[1,1]
falseNegative = cm$table[1,2]
falsePositive = cm$table[2,1]
truePositve   = cm$table[2,2]

conditionPositive = truePositve + falseNegative
conditionNegative = trueNegative + falsePositive 
predictedConditionPositive = truePositve +  falsePositive 
predictedConditionNegative = trueNegative + falseNegative

TotalPopulation = truePositve + falsePositive + trueNegative + falseNegative

prevalence = conditionPositive / TotalPopulation
accuracy = (truePositve + trueNegative) / TotalPopulation
PPV = truePositve / predictedConditionPositive
FDR = falsePositive / predictedConditionPositive
FOR = falseNegative / predictedConditionNegative
NPV = trueNegative / predictedConditionNegative
TPR = truePositve / conditionPositive
FPR = falsePositive / conditionNegative
FNR = falseNegative / conditionPositive
SPC = TNR = trueNegative / conditionNegative
LR.PLUS = TPR / FPR
LR.MINUS = FNR/TNR
DOR = LR.PLUS / LR.MINUS

F_1_score = 2 * ((PPV * TPR)/(PPV + TPR))

Now using the optimal threshold we get:

# confusionMatrix

THRESHOLD <- opt_thresh
out_of_folds_CM <- classifier$pred %>% 
  dplyr::mutate(pred2 = ifelse(Up>THRESHOLD, 'Up', 'Down')) %>%
  dplyr::mutate(pred2 = factor(pred2, levels=c('Down','Up'))) %>%
  dplyr::group_split(Resample) %>%
  purrr::map(~ caret::confusionMatrix(data=.x$pred2, reference=.x$obs))

out_of_folds_CM_vis <- out_of_folds_CM %>% 
  purrr::map(~ .x$table %>% broom::tidy() %>%
                 ggplot(aes(x=Reference, y=Prediction, fill=n, label=n)) +
                 geom_tile() + geom_text(aes(label=n), size=5, color='white'))
 
gridExtra::grid.arrange(grobs=out_of_folds_CM_vis, ncol=2)

Summary Stats Logistic Regression

options(yardstick.event_first = FALSE)
ROC_curve <- predict(classifier, type='prob') %>% 
  yardstick::roc_curve(truth=train$BlueTarp_rest, estimate=Up) %>%
  dplyr::mutate(one_minus_specificity = 1-specificity)

# Sensitivity: TP/(TP+FN)
# 1-Specificity: FP/(TN+FP)

ROC_curve_plot <- ROC_curve %>%
  ggplot(aes(x=one_minus_specificity, y=sensitivity)) + 
  geom_line() + geom_point() +
  geom_abline(slope=1, intercept=0, linetype='dashed', color='red') +
  xlab("one_minus_specificity\n(false positive rate)") +
  ggtitle('ROC curve')

ggplotly(ROC_curve_plot)
temp = predict(classifier, type='prob') %>% yardstick::roc_auc(truth=train$BlueTarp_rest, Up) 

glm.Accuracy = accuracy
glm.AUC = temp$.estimate
twoclassSum = twoClassSummary( data = data.frame(obs = classifier$pred$obs,
  pred = classifier$pred$pred,
  Down = classifier$pred$Down, 
  Up = classifier$pred$Up), lev = levels(classifier$pred$obs))

glm.ROC = "TRUE" #unname(twoclassSum[1])
glm.Threshold = opt_thresh
glm.Sensitivity = glm.Recall= glm.Power =  TPR
glm.Specificity= 1-FPR
glm.FDR = FDR
glm.Precision=PPV
df = data.frame(c(glm.Accuracy, glm.AUC, glm.ROC, glm.Threshold, glm.Sensitivity, glm.Specificity, glm.FDR , glm.Precision))

colnames(df) <- c("Logistic Regression")
rownames(df) <- c("Accuracy","AUC", "ROC", "Threshold","Sensitivity=Recall=Power","Specificity=1-FPR","FDR","Precision=PPV")
df %>% kbl() %>% kable_styling()
Logistic Regression
Accuracy 0.990845348813016
AUC 0.998727855363222
ROC TRUE
Threshold 0.1
Sensitivity=Recall=Power 0.964355152300713
Specificity=1-FPR 0.991723989681857
FDR 0.205552589428724
Precision=PPV 0.794447410571276

Conclusions

df = data.frame(c(knn.Accuracy, knn.AUC, knn.ROC, knn.Threshold, knn.Sensitivity, knn.Specificity, knn.FDR , knn.Precision),
                c(lda.Accuracy, lda.AUC, lda.ROC, lda.Threshold, lda.Sensitivity, lda.Specificity, lda.FDR , lda.Precision),
                c(qda.Accuracy, qda.AUC, qda.ROC, qda.Threshold, qda.Sensitivity, qda.Specificity, qda.FDR , qda.Precision), 
                c(glm.Accuracy, glm.AUC, glm.ROC, glm.Threshold, glm.Sensitivity, glm.Specificity, glm.FDR , glm.Precision))


colnames(df) <- c(paste0("KNN (k =",bestk,")"), "LDA","QDA","Logistic Regression")
rownames(df) <- c("Accuracy","AUC", "ROC", "Threshold","Sensitivity=Recall=Power","Specificity=1-FPR","FDR","Precision=PPV")
df %>% kbl() %>% kable_styling()
KNN (k =11) LDA QDA Logistic Regression
Accuracy 0.996026049143832 0.982606162744731 0.990241974075692 0.990845348813016
AUC 0.999812734291107 0.989589213539748 0.998273998625808 0.998727855363222
ROC TRUE TRUE TRUE TRUE
Threshold 0.1 0.1 0.1 0.1
Sensitivity=Recall=Power 0.985093972780298 0.856124432922878 0.908619572261828 0.964355152300713
Specificity=1-FPR 0.996388650042992 0.986801375752365 0.992949269131556 0.991723989681857
FDR 0.0995260663507109 0.317312661498708 0.189595375722543 0.205552589428724
Precision=PPV 0.900473933649289 0.682687338501292 0.810404624277457 0.794447410571276

In order to determine the the overall optimal classification method, we’ll use the scoring system of 5 for first place, 3 for second, 2 for third, and 1 for fourth to rank each model by select various statistics.

If we look at the statistics provided, accuracy is a measure of \(\frac{True Positive + TrueNegative}{TotalPopulation}\). So the optimal model should be one that has highest accuracy

KNN (k = 11 ) LDA QDA Logistic Regression
5 1 2 3
————- ————- ————- ————-
5 1 2 3

AUC, or Area Under Curve, is a measure of performance across all possible classification thresholds. Lowering the classification threshold classifies more items as positive. As such, you’ll want a model with a high AUC value.

KNN (k = 11 ) LDA QDA Logistic Regression
5 1 2 3
————- ————- ————- ————-
10 2 4 6

Using the same threshold of 0.1 across the same training set, we can compare the sensitivities and Specificities.

Sensitivity is defined as \(\frac{True Positive}{TruePositive + FalseNegative}\). Specificity is defined as \(\frac{True Negative}{FalsePositive + TrueNegative}\).

Both of which, you’d want to have high values in order to correctly predict observations.

KNN (k = 11 ) LDA QDA Logistic Regression
5 1 2 3
————- ————- ————- ————-
15 3 6 9
KNN (k = 11 ) LDA QDA Logistic Regression
5 1 3 2
————- ————- ————- ————-
20 4 9 11

The False Discovery Rate (FDR) is the expected proportion of type I errors ( you get a false positive ). The optimal model is one with a minimum value for FDR.

KNN (k = 11 ) LDA QDA Logistic Regression
5 1 3 2
————- ————- ————- ————-
25 5 12 13

Precision is a measure of \(\frac{True Positive}{FalsePositive + TruePositive}\). Ideally, you’ll want a high value of precision.

KNN (k = 11 ) LDA QDA Logistic Regression
5 1 3 2
————- ————- ————- ————-
30 6 15 15

The determination of which algorithm works best KNN was shown to be the optimal classifier, followed by logistic regression / quadratic discriminant analysis and lastly linear discriminant analysis. Although KNN performed the best, the logistic regression / quadratic discriminant analysis models proved to have a comparative accuracy and precision. These models have higher FDR, but this is due to the fact of the selected threshold. With QDA and Logistic Regression performing better than LDA, we can make the assumption that the data has a non-linear boundary between classes. For much more complicated decision boundaries, a non-parametric approach such as KNN can be superior. In addition, the KNN model also supports this assumption since KNN typically performs better for non-linear data classification. When the true decision boundaries are linear, then the LDA and logistic regression approaches will tend to perform well. When the boundaries are moderately non-linear, QDA may give better results.

An additional recommended action that might be taken to improve results is to select a threshold that may be slightly more precise and compared for each classifier. As such, I opted to use the \(F_1\) Score since it can be used to measure the performance of the unbalanced classification data set and cross referenced that with the use of the geometric mean since it can “indicate a poor performance in the classification of the positive cases even if the negative cases are correctly classified as such” (Akosa). The more detailed thresholds appeared to be less than 0.1, however you lose some flexibility in applying future data to the generated classifiers.

The variables associated with this data set are additive colors of the RBG color spectrum. This data formulation allows us to address it with predictive modeling tools since the variables stem from the data being an unbalanced binary classification ( an unequal distribution of classes in the training dataset). Since the goal was to find an algorithm that could effectively search the images in order to locate displaced persons and communicate those locations rescue workers so they could help those who needed it in time, the work show above can be used to find these persons and administer aide. Due to limited resources, it would not be feasible to save everyone, but adjusting the threshold would allow us to save more persons than leaving it as default.

In regards to software aspects, we expect knn to take the longest performance. The time required to do 10-fold cross validation adjusted for using tuning parameters far exceed those for LDA, GDA and GLM. However it is worth noting, that when running KNN for 1 singular value of K, the performance time been the different classification models is negligible. Furthermore, you may notice that in my confusion matrices I used: caret, yardstick and mean. All 3 produced nearly the same number of True Positives at a given threshold yet a few instances will vary. This may be due to how the underlining architecture handles floating point precision and random number generations ( especially if any step re-randomizes the training data). My calculations tend to have a large number of significant digits but if rounding to a lesser number of significant digits, the values are comparable. Also, since the goal of the analysis was to increase the number of True Positives with changes to threshold, was deemed a fair trade-off if we raised the number of these in each model.

References