# ggpairs
library(ggplot2)
library(GGally)
library(hrbrthemes)
# %>%
library(tidyverse)

# kfold
library(caret)
kfoldcv = 10

# lda
library(MASS)

# prediction
library(ROCR)

#partimat
library(klaR)
library(gridExtra)
library(yardstick)
library(plotly)
library(kableExtra)

Introduction

In this project, you will use classification methods covered in this course to solve a real historical data mining problem: locating displaced persons living in makeshift shelters following the destruction of the earthquake in Haiti in 2010.

Following that earthquake, rescue workers, mostly from the United States military, needed to get food and water to the displaced persons. But with destroyed communications, impassable roads, and thousands of square miles, actually locating the people who needed help was challenging. As part of the rescue effort, a team from the Rochester Institute of Technology were flying an aircraft to collect high resolution geo-referenced imagery.

HaitiPixels <- read.csv(file = '/Users/awells/UVA/SYS6018/Disaster_Relief_Project/HaitiPixels.csv')
head(HaitiPixels)

##        Class Red Green Blue
## 1 Vegetation  64    67   50
## 2 Vegetation  64    67   50
## 3 Vegetation  64    66   49
## 4 Vegetation  75    82   53
## 5 Vegetation  74    82   54
## 6 Vegetation  72    76   52

HaitiPixels$Class <- as.factor(HaitiPixels$Class)
#summary(HaitiPixels)

#### Delete these 2 lines when ready for full set
#data <- sample_n(HaitiPixels, 10000)
#HaitiPixels <- data
#### 

dim(HaitiPixels)

## [1] 63241     4

summary(HaitiPixels)

##               Class            Red          Green            Blue      
##  Blue Tarp       : 2022   Min.   : 48   Min.   : 48.0   Min.   : 44.0  
##  Rooftop         : 9903   1st Qu.: 80   1st Qu.: 78.0   1st Qu.: 63.0  
##  Soil            :20566   Median :163   Median :148.0   Median :123.0  
##  Various Non-Tarp: 4744   Mean   :163   Mean   :153.7   Mean   :125.1  
##  Vegetation      :26006   3rd Qu.:255   3rd Qu.:226.0   3rd Qu.:181.0  
##                           Max.   :255   Max.   :255.0   Max.   :255.0

attach(HaitiPixels)

It was known that the people whose homes had been destroyed by the earthquake were creating temporary shelters using blue tarps, and these blue tarps would be good indicators of where the displaced persons were – if only they could be located in time, out of the thousands of images that would be collected every day.

detach(HaitiPixels)
# Blue Tarp
BlueTarp_rest <- rep("Down",length(HaitiPixels$Class))
BlueTarp_rest[HaitiPixels$Class == "Blue Tarp"] <- "Up"

HaitiPixels <- data.frame(HaitiPixels,BlueTarp_rest )

HaitiPixels$BlueTarp_rest <- as.factor(HaitiPixels$BlueTarp_rest)

summary(HaitiPixels)

##               Class            Red          Green            Blue      
##  Blue Tarp       : 2022   Min.   : 48   Min.   : 48.0   Min.   : 44.0  
##  Rooftop         : 9903   1st Qu.: 80   1st Qu.: 78.0   1st Qu.: 63.0  
##  Soil            :20566   Median :163   Median :148.0   Median :123.0  
##  Various Non-Tarp: 4744   Mean   :163   Mean   :153.7   Mean   :125.1  
##  Vegetation      :26006   3rd Qu.:255   3rd Qu.:226.0   3rd Qu.:181.0  
##                           Max.   :255   Max.   :255.0   Max.   :255.0  
##  BlueTarp_rest
##  Down:61219   
##  Up  : 2022   
##               
##               
##               
##

attach(HaitiPixels)
BlueTarp_rest <- as.factor(BlueTarp_rest)

The problem was that there was no way for aid workers to search the thousands of images in time to find the tarps and communicate the locations back to the rescue workers on the ground in time. The solution would be provided by data-mining algorithms, which could search the images far faster and more thoroughly (and accurately?) then humanly possible.

The goal was to find an algorithm that could effectively search the images in order to locate displaced persons and communicate those locations rescue workers so they could help those who needed it in time. This disaster relief project is the subject matter for your project in this course, which you will submit in two parts. You will use data from the actual data collection process was carried out over Haiti. Your goal is to test each of the algorithms you learn in this course on the imagery data collected during the relief efforts made Haiti in response to the 2010 earthquake, and determine which method you will use to as accurately as possible, and in as timely a manner as possible, locate as many of the displaced persons identified in the imagery data so that they can be provided food and water before their situations become unsurvivable.

HaitiPixels %>% ggpairs(upper=list(continuous=wrap("cor", size=3)))

Setup

For the set up of the data set ( HaitiPixels, \(n=\) 63241) a likely decision is to use a one vs rest approach. Our data is a multi-class classification problem with binary classifications:

Blue Tarp vs [ Rooftop , Soil , Various Non-Tarp, Vegetation ]
Rooftop vs [ Blue Tarp, Soil , Various Non-Tarp, Vegetation ]
Soil vs [ Blue Tarp, Rooftop , Various Non-Tarp, Vegetation ]
Various Non-Tarp vs [ Blue Tarp, Rooftop , Soil , Vegetation ]
Vegetation vs [ Blue Tarp, Rooftop , Soil , Various Non-Tarp ]

# Get Final Hold - Out 5% of Data ~3000 observations
data.fho = sort(sample(nrow(HaitiPixels), nrow(HaitiPixels)*.05))

fho <- HaitiPixels[data.fho,]
data_remain <- HaitiPixels[-data.fho,]

#80-20 split Train-Test

data_minus_fho = sort(sample(nrow(data_remain), nrow(data_remain)*.8))
train <-  data_remain[data_minus_fho,]
test <- data_remain[-data_minus_fho,]

# shuffle just to be safe
train <- dplyr::sample_n(train, nrow(train))
test <- dplyr::sample_n(test, nrow(test))
fho <- dplyr::sample_n(fho, nrow(fho))

test.BlueTarp <- as.factor(test$BlueTarp_rest)
train$BlueTarp_rest <- as.factor(train$BlueTarp_rest)
levels(train$BlueTarp_rest) <- c("Down","Up")

#paste("Length of Train: ", nrow(train))
#paste("Length of Test: ", nrow(test))
#paste("Length of Final Hold Out: ", nrow(fho))

Since we are looking for temporary shelters made using blue tarps, we generated a variable named “BlueTarp_rest”, a binary classifier that has two levels

“Up” - if a blue tarp was visible (or up )
“Down” - if a blue tarp was not visible.

We’ll take a random sample of the HaitiPixels dataset that’s 5% of the data set. We’ll name this the FHO, or Final HoldOut Dataset to be used in Part 2 ( Hold-Out Test Sample Performance). We randomly take the remaining data and divide it into an 80% training set and 20% test set. This leaves a training set of \(n=\) 48063, a test set of \(n=\) 12016 and a FHO of \(n=\) 3162

Analysis

KNN

The first classification process that we used is the K-nearest neighbors regression. Given a value for K and a prediction point \(x_0\), KNN regression first identifies the K training observations that are closest to \(x_0\), represented by \(N_0\). The optimal value for K will depend on the bias-variance tradeoff. A small value for K provides the most flexible fit, which will have low bias but high variance. This variance is due to the fact that the prediction in a given region is entirely dependent on just one observation. In contrast, larger values of K provide a smoother and less variable fit; the prediction in a region is an average of several points, and so changing one observation has a smaller effect. We’ll use the caret package in R to apply the KNN classification procedure using a 10 - Fold Cross-Validation evaluation procedure.

set.seed(1)

# https://blog.revolutionanalytics.com/2016/05/using-caret-to-compare-models.html
classifier <- caret::train(BlueTarp_rest ~ Red + Green + Blue, data=train, method="knn", 
              preProcess=c("center","scale"), 
              tuneGrid=data.frame(k=seq(1,75,1)),
              #tuneGrid=data.frame(k=seq(1,200,2)),
              #tuneGrid=data.frame(k=seq(1,500,5)),
              trControl = caret::trainControl("cv", number=kfoldcv, 
                          returnResamp='all', savePredictions='final',
                          classProbs=TRUE))

classifier

## k-Nearest Neighbors 
## 
## 48063 samples
##     3 predictor
##     2 classes: 'Down', 'Up' 
## 
## Pre-processing: centered (3), scaled (3) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 43257, 43256, 43257, 43256, 43257, 43257, ... 
## Resampling results across tuning parameters:
## 
##   k   Accuracy   Kappa    
##    1  0.9967335  0.9473104
##    2  0.9965670  0.9447763
##    3  0.9971288  0.9538469
##    4  0.9968999  0.9500344
##    5  0.9970872  0.9532107
##    6  0.9971288  0.9540576
##    7  0.9971912  0.9549753
##    8  0.9971704  0.9546549
##    9  0.9971496  0.9542347
##   10  0.9971288  0.9539796
##   11  0.9972120  0.9552741
##   12  0.9971288  0.9538295
##   13  0.9971912  0.9548711
##   14  0.9971912  0.9549441
##   15  0.9971912  0.9549148
##   16  0.9971496  0.9542795
##   17  0.9971288  0.9538579
##   18  0.9971080  0.9534914
##   19  0.9970040  0.9518402
##   20  0.9970456  0.9524746
##   21  0.9970872  0.9531428
##   22  0.9970456  0.9524740
##   23  0.9970040  0.9517660
##   24  0.9969624  0.9510828
##   25  0.9969624  0.9510522
##   26  0.9968792  0.9496469
##   27  0.9969832  0.9513546
##   28  0.9969624  0.9509477
##   29  0.9970040  0.9516051
##   30  0.9969000  0.9499607
##   31  0.9969208  0.9502750
##   32  0.9967543  0.9474787
##   33  0.9967751  0.9477900
##   34  0.9966919  0.9464564
##   35  0.9966711  0.9460519
##   36  0.9966503  0.9457844
##   37  0.9966503  0.9457329
##   38  0.9965046  0.9432600
##   39  0.9965671  0.9443107
##   40  0.9965046  0.9433280
##   41  0.9965046  0.9432968
##   42  0.9965463  0.9439983
##   43  0.9965254  0.9436276
##   44  0.9964422  0.9421373
##   45  0.9964630  0.9425044
##   46  0.9964838  0.9427939
##   47  0.9965046  0.9431580
##   48  0.9964214  0.9417334
##   49  0.9964006  0.9413828
##   50  0.9962966  0.9396722
##   51  0.9963382  0.9403870
##   52  0.9963174  0.9400295
##   53  0.9963174  0.9399937
##   54  0.9962342  0.9386490
##   55  0.9962966  0.9396052
##   56  0.9962550  0.9389496
##   57  0.9962134  0.9382596
##   58  0.9961926  0.9379827
##   59  0.9962342  0.9385854
##   60  0.9961093  0.9365147
##   61  0.9961301  0.9368901
##   62  0.9961301  0.9369300
##   63  0.9961093  0.9366020
##   64  0.9961301  0.9369682
##   65  0.9961717  0.9376369
##   66  0.9961717  0.9376050
##   67  0.9961301  0.9368944
##   68  0.9962134  0.9382634
##   69  0.9961509  0.9372729
##   70  0.9960677  0.9359312
##   71  0.9960885  0.9362566
##   72  0.9961718  0.9376313
##   73  0.9961509  0.9372984
##   74  0.9960885  0.9362384
##   75  0.9960469  0.9354902
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 11.

bestk = classifier$bestTune

Testing across various values of K, we can see that our optimal K is 11. So we’ll proceed with the analysis using this value.

classifier %>% ggplot(aes(x=seq_along(Accuracy), y=Accuracy))

classifier$resample %>% 
  dplyr::group_split(Resample) %>% 
  purrr::map(rowid_to_column) %>%
  dplyr::bind_rows() %>%
  ggplot(aes(rowid, Accuracy)) + geom_point() +
  geom_smooth(formula='y~x', method='loess', span=.03) +
  geom_line(classifier$results, mapping=aes(seq_along(Accuracy), Accuracy),
            size=2, color='red')

Next we’ll look at the Confusion Matrix from the model. We’ll use the image below as a reference to analyzing the matrix.

Confusion Matrix Reference

# confusionMatrix
THRESHOLD <- 0.5
out_of_folds_CM <- classifier$pred %>% 
  dplyr::mutate(pred2 = ifelse(Up>THRESHOLD, 'Up', 'Down')) %>%
  dplyr::mutate(pred2 = factor(pred2, levels=c('Down','Up'))) %>%
  dplyr::group_split(Resample) %>%
  purrr::map(~ caret::confusionMatrix(data=.x$pred2, reference=.x$obs))

out_of_folds_CM_vis <- out_of_folds_CM %>% 
  purrr::map(~ .x$table %>% broom::tidy() %>%
                 ggplot(aes(x=Reference, y=Prediction, fill=n, label=n)) +
                 geom_tile() + geom_text(aes(label=n), size=5, color='white'))
 
gridExtra::grid.arrange(grobs=out_of_folds_CM_vis, ncol=2)

From the 10-Fold cross validation we can see we have our true positive references to a blue tarp. We can measure the sensitivity also known as the true positive rate. This is the fraction of observations that are correctly identified, using a given threshold value. The initial default threshold is 0.5. The false positive rate is 1-specificity; the fraction of observations that we classify incorrectly, using that same threshold value of 0.5. We can view the receiver operating characteristic curve or ROC curve to measure these two statistics. The ideal ROC curve hugs the top left corner, indicating a high true positive rate and a low false positive rate.

options(yardstick.event_first = FALSE)
ROC_curve <- predict(classifier, type='prob') %>% 
  yardstick::roc_curve(truth=train$BlueTarp_rest, estimate=Up) %>%
  dplyr::mutate(one_minus_specificity = 1-specificity)

## Warning: The `yardstick.event_first` option has been deprecated as of yardstick 0.0.7 and will be completely ignored in a future version.
## Instead, set the following argument directly in the metric function:
## `options(yardstick.event_first = TRUE)`  -> `event_level = 'first'` (the default)
## `options(yardstick.event_first = FALSE)` -> `event_level = 'second'`
## This warning is displayed once per session.

# Sensitivity: TP/(TP+FN)
# 1-Specificity: FP/(TN+FP)

ROC_curve_plot <- ROC_curve %>%
  ggplot(aes(x=one_minus_specificity, y=sensitivity)) + 
  geom_line() + geom_point() +
  geom_abline(slope=1, intercept=0, linetype='dashed', color='red') +
  xlab("one_minus_specificity\n(false positive rate)") +
  ggtitle('ROC curve')

ggplotly(ROC_curve_plot)

predict(classifier, type='prob') %>% 
  yardstick::roc_auc(truth=train$BlueTarp_rest, Up)

## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary          1.00

ROC_curve %>% head()

## # A tibble: 6 x 4
##   .threshold specificity sensitivity one_minus_specificity
##        <dbl>       <dbl>       <dbl>                 <dbl>
## 1  -Inf            0           1                   1      
## 2     0            0           1                   1      
## 3     0.0672       0.991       1                   0.00939
## 4     0.0826       0.995       0.990               0.00462
## 5     0.0833       0.995       0.988               0.00462
## 6     0.0909       0.995       0.987               0.00460

predict(classifier, type='prob') %>% head()

##   Down Up
## 1    1  0
## 2    1  0
## 3    1  0
## 4    1  0
## 5    1  0
## 6    1  0

We can also look at each ROC curve from each fold from the KNN model.

ROC_curveS <- classifier$pred %>%
  dplyr::group_split(Resample) %>% 
  purrr::map(~ yardstick::roc_curve(.x, truth=obs, Up)) %>%
  purrr::map(~ dplyr::mutate(.x, one_minus_specificity = 1-specificity)) %>%
  purrr::map(~ geom_line(.x, mapping=aes(one_minus_specificity, sensitivity)))

ROC_curveS[[1]] <- ggplot() + ROC_curveS[[1]]
Reduce("+", ROC_curveS) + 
  geom_abline(slope=1, intercept=0, linetype='dashed', color='red') +
  xlab("one_minus_specificity\n(false positive rate)") +
  ggtitle('ROC Curves')

To find the optimal threshold, well look at using the \(F_1\) Score.

\(F_1\) score is more useful measure than accuracy for problems with uneven class distributions since \(F_1\) is defined as :

\[ F_1 = 2 * \frac{\frac{\Sigma True Positive}{\Sigma Predicted Condition Positive}*\frac{\Sigma True Positive}{\Sigma Condition Positive}}{\frac{\Sigma True Positive}{\Sigma Predicted Condition Positive}+\frac{\Sigma True Positive}{\Sigma Condition Positive}}\]

it takes into account both false positive and false negatives. The optimal \(F_1\) Score is between 0 and 1.

tuned_call <- function(preds, decision_rule_threshold){
  calls <- rep('Down', length(preds))
  calls[preds>decision_rule_threshold]='Up'
  factor(calls, levels=c('Down','Up'))
}


thresholds = seq(0.01,0.99,.01) # Avoid Threshold Tails
F_1_scores <- rep(NA, length(thresholds))
GMeans <- rep(NA, length(thresholds))

preds <- predict(classifier, type='prob')[,2]
y <- train$BlueTarp_rest
for (i in 1:length(thresholds)){
  yhat <- tuned_call(preds, thresholds[i])
  one_minus_specificity <- mean(yhat[y=='Down']=='Up')
  sensitivity <- mean(yhat[y=='Up']=='Up')
  one_minus_specificity
  sensitivity
  results <- tibble::tibble(y=y, yhat=yhat)
  
  caret::confusionMatrix(yhat, y, positive = "Up") 
  cm <- results %>%yardstick::conf_mat(truth=y, yhat)
  
  trueNegative  = cm$table[1,1]
  falseNegative = cm$table[1,2]
  falsePositive = cm$table[2,1]
  truePositve   = cm$table[2,2]
  
  #paste0("True Positive: ", truePositve )
  #paste0("False Negative: ", falseNegative )
  #paste0("True Negative: ", trueNegative )
  #paste0("False Positive: ", falsePositive )
  conditionPositive = truePositve + falseNegative
  conditionNegative = trueNegative + falsePositive 
  predictedConditionPositive = truePositve +  falsePositive 
  predictedConditionNegative = trueNegative + falseNegative
  
  TotalPopulation = truePositve + falsePositive + trueNegative + falseNegative
  
  prevalence = conditionPositive / TotalPopulation
  accuracy = (truePositve + trueNegative) / TotalPopulation
  PPV = truePositve / predictedConditionPositive
  FDR = falsePositive / predictedConditionPositive
  FOR = falseNegative / predictedConditionNegative
  NPV = trueNegative / predictedConditionNegative
  TPR = truePositve / conditionPositive
  FPR = falsePositive / conditionNegative
  FNR = falseNegative / conditionPositive
  SPC = TNR = trueNegative / conditionNegative
  LR.PLUS = TPR / FPR
  LR.MINUS = FNR/TNR
  DOR = LR.PLUS / LR.MINUS
  
  F_1_score = 2 * ((PPV * TPR)/(PPV + TPR))
  F_1_scores[i] <- F_1_score
  GMeans[i] = sqrt(TPR*SPC)
}

opt_thresh = thresholds[tail(which(na.omit(F_1_scores) == min(na.omit(F_1_scores))), n=1)]

if (opt_thresh < 0.1){opt_thresh = 0.1}
if (opt_thresh > 0.5){opt_thresh = 0.5}

#opt_thresh

#points(opt_thresh, F_1_scores[tail(which(na.omit(F_1_scores) == min(na.omit(F_1_scores))), n=1)], col = 'red', cex = 2, pch = 19)

If the optimal threshold is less than 0.1, we will set our optimal threshold to 0.1. We do this because we do not want to be too specific in our threshold. In real world example, we have to live with the fact that we cannot help all those affected due to limited resources. Our goal is to help as many as those affected, so we want to have a little more flexibility in our threshold.

plot(thresholds, F_1_scores, xlab = 'Thresholds', ylab = 'F1 Score', type = 'l')

plot(thresholds, GMeans, xlab = 'Thresholds', ylab = 'Geometric Mean', type = 'l')

thresholds[tail(which(na.omit(GMeans) == max(na.omit(GMeans))), n=1)]

## [1] 0.06

paste0("Confusion Matrix 0.5")

## [1] "Confusion Matrix 0.5"

y <- train$BlueTarp_rest
yhat <- tuned_call(preds, 0.5)
one_minus_specificity <- mean(yhat[y=='Down']=='Up')
sensitivity <- mean(yhat[y=='Up']=='Up')
results <- tibble::tibble(y=y, yhat=yhat)

caret::confusionMatrix(yhat, y, positive = "Up")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  Down    Up
##       Down 46457    53
##       Up      63  1490
##                                          
##                Accuracy : 0.9976         
##                  95% CI : (0.9971, 0.998)
##     No Information Rate : 0.9679         
##     P-Value [Acc > NIR] : <2e-16         
##                                          
##                   Kappa : 0.9613         
##                                          
##  Mcnemar's Test P-Value : 0.4034         
##                                          
##             Sensitivity : 0.96565        
##             Specificity : 0.99865        
##          Pos Pred Value : 0.95943        
##          Neg Pred Value : 0.99886        
##              Prevalence : 0.03210        
##          Detection Rate : 0.03100        
##    Detection Prevalence : 0.03231        
##       Balanced Accuracy : 0.98215        
##                                          
##        'Positive' Class : Up             
##

cm <- results %>%yardstick::conf_mat(truth=y, yhat)

From the initial threshold we can see:

True Negative = 46520
False Negative = 255
False Positive = 0
True Positive = 1288
Condition Positive = 1543
condition Negative = 46520
Predicted Condition Positive = 1288
Predicted Condition Negative = 46775
Total Population = 48063
Prevalence = 0.032103697230718
Accuracy = 0.994694463516634
Positive predictive value, precision = 1
False Discovery rate = 0
False Omission Rate = 0.00545163014430786
Negative Predictive Value = 0.994548369855692
True Positive Rate, Recall, Sensitivity = 0.834737524303305
False Positive Rate = 0
False Negative Rate = 0.165262475696695
Specificity, Selectivity, True Negative rate = 1
Positive likelihood ratio = Inf
Negative likelihood ratio = 0.165262475696695
Diagnostic Odds Ratio = Inf
\(F_1\) Score = 0.909925821264571

From our plot we selected our optimal threshold as 0.1.

yhat <- tuned_call(preds, opt_thresh)
one_minus_specificity <- mean(yhat[y=='Down']=='Up')
sensitivity <- mean(yhat[y=='Up']=='Up')
one_minus_specificity

## [1] 0.00361135

sensitivity

## [1] 0.985094

results <- tibble::tibble(y=y, yhat=yhat)

caret::confusionMatrix(yhat, y, positive = "Up")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  Down    Up
##       Down 46352    23
##       Up     168  1520
##                                           
##                Accuracy : 0.996           
##                  95% CI : (0.9954, 0.9966)
##     No Information Rate : 0.9679          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9388          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.98509         
##             Specificity : 0.99639         
##          Pos Pred Value : 0.90047         
##          Neg Pred Value : 0.99950         
##              Prevalence : 0.03210         
##          Detection Rate : 0.03163         
##    Detection Prevalence : 0.03512         
##       Balanced Accuracy : 0.99074         
##                                           
##        'Positive' Class : Up              
##

cm <- results %>%yardstick::conf_mat(truth=y, yhat)

trueNegative  = cm$table[1,1]
falseNegative = cm$table[1,2]
falsePositive = cm$table[2,1]
truePositve   = cm$table[2,2]

conditionPositive = truePositve + falseNegative
conditionNegative = trueNegative + falsePositive 
predictedConditionPositive = truePositve +  falsePositive 
predictedConditionNegative = trueNegative + falseNegative

TotalPopulation = truePositve + falsePositive + trueNegative + falseNegative

prevalence = conditionPositive / TotalPopulation
accuracy = (truePositve + trueNegative) / TotalPopulation
PPV = truePositve / predictedConditionPositive
FDR = falsePositive / predictedConditionPositive
FOR = falseNegative / predictedConditionNegative
NPV = trueNegative / predictedConditionNegative
TPR = truePositve / conditionPositive
FPR = falsePositive / conditionNegative
FNR = falseNegative / conditionPositive
SPC = TNR = trueNegative / conditionNegative
LR.PLUS = TPR / FPR
LR.MINUS = FNR/TNR
DOR = LR.PLUS / LR.MINUS

F_1_score = 2 * ((PPV * TPR)/(PPV + TPR))

Now using the optimal threshold we get:

True Negative = 46352
False Negative = 23
False Positive = 168
True Positive = 1520
Condition Positive = 1543
condition Negative = 46520
Predicted Condition Positive = 1688
Predicted Condition Negative = 46375
Total Population = 48063
Prevalence = 0.032103697230718
Accuracy = 0.996026049143832
Positive predictive value, precision = 0.900473933649289
False Discovery rate = 0.0995260663507109
False Omission Rate = 0.000495956873315364
Negative Predictive Value = 0.999504043126685
True Positive Rate, Recall, Sensitivity = 0.985093972780298
False Positive Rate = 0.00361134995700774
False Negative Rate = 0.0149060272197019
Specificity, Selectivity, True Negative rate = 0.996388650042992
Positive likelihood ratio = 272.777211986544
Negative likelihood ratio = 0.0149600532072086
Diagnostic Odds Ratio = 18233.7060041408
\(F_1\) Score = 0.940885174868462

# confusionMatrix

THRESHOLD <- opt_thresh
out_of_folds_CM <- classifier$pred %>% 
  dplyr::mutate(pred2 = ifelse(Up>THRESHOLD, 'Up', 'Down')) %>%
  dplyr::mutate(pred2 = factor(pred2, levels=c('Down','Up'))) %>%
  dplyr::group_split(Resample) %>%
  purrr::map(~ caret::confusionMatrix(data=.x$pred2, reference=.x$obs))

out_of_folds_CM_vis <- out_of_folds_CM %>% 
  purrr::map(~ .x$table %>% broom::tidy() %>%
                 ggplot(aes(x=Reference, y=Prediction, fill=n, label=n)) +
                 geom_tile() + geom_text(aes(label=n), size=5, color='white'))
 
gridExtra::grid.arrange(grobs=out_of_folds_CM_vis, ncol=2)

Summary Stats K Nearest Neighbor

options(yardstick.event_first = FALSE)
ROC_curve <- predict(classifier, type='prob') %>% 
  yardstick::roc_curve(truth=train$BlueTarp_rest, estimate=Up) %>%
  dplyr::mutate(one_minus_specificity = 1-specificity)

# Sensitivity: TP/(TP+FN)
# 1-Specificity: FP/(TN+FP)

ROC_curve_plot <- ROC_curve %>%
  ggplot(aes(x=one_minus_specificity, y=sensitivity)) + 
  geom_line() + geom_point() +
  geom_abline(slope=1, intercept=0, linetype='dashed', color='red') +
  xlab("one_minus_specificity\n(false positive rate)") +
  ggtitle('ROC curve')

ggplotly(ROC_curve_plot)

temp = predict(classifier, type='prob') %>% yardstick::roc_auc(truth=train$BlueTarp_rest, Up) 

knn.Accuracy = accuracy
knn.AUC = temp$.estimate

twoclassSum = twoClassSummary( data = data.frame(obs = classifier$pred$obs,
  pred = classifier$pred$pred,
  Down = classifier$pred$Down, 
  Up = classifier$pred$Up), lev = levels(classifier$pred$obs))

knn.ROC = "TRUE" #unname(twoclassSum[1])
knn.Threshold = opt_thresh
knn.Sensitivity = knn.Recall= knn.Power =  TPR
knn.Specificity= 1-FPR
knn.FDR = FDR
knn.Precision=PPV

df = data.frame(c(knn.Accuracy, knn.AUC, knn.ROC, knn.Threshold, knn.Sensitivity, knn.Specificity, knn.FDR , knn.Precision))


colnames(df) <- c(paste0("KNN (k =",bestk,")"))
rownames(df) <- c("Accuracy","AUC", "ROC", "Threshold","Sensitivity=Recall=Power","Specificity=1-FPR","FDR","Precision=PPV")
df %>% kbl() %>% kable_styling()

	KNN (k =11)
Accuracy	0.996026049143832
AUC	0.999812734291107
ROC	TRUE
Threshold	0.1
Sensitivity=Recall=Power	0.985093972780298
Specificity=1-FPR	0.996388650042992
FDR	0.0995260663507109
Precision=PPV	0.900473933649289

LDA

The next classification process that we used is the linear discriminant analysis (LDA). LDA assumes that the observations within each class are drawn from a multivariate Gaussian distribution with a class specific mean vector and a covariance matrix that is common to all K classes. Again, we’ll use the caret package in R to apply the LDA classification procedure using a 10 - Fold Cross-Validation evaluation procedure.

set.seed(1)
train$BlueTarp_rest <- as.factor(train$BlueTarp_rest)
levels(train$BlueTarp_rest) <- c("Down","Up")
# https://blog.revolutionanalytics.com/2016/05/using-caret-to-compare-models.html

classifier <- caret::train(BlueTarp_rest ~ Red + Green + Blue, data=train, method="lda", 
              preProcess=c("center","scale"),
              #tuneGrid=data.frame(.dimen = seq(1,10,1)),#seq(1,1000,5)),
              trControl = caret::trainControl("cv", number=kfoldcv, 
              returnResamp='all', savePredictions='final',
              classProbs=TRUE))

classifier

## Linear Discriminant Analysis 
## 
## 48063 samples
##     3 predictor
##     2 classes: 'Down', 'Up' 
## 
## Pre-processing: centered (3), scaled (3) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 43257, 43256, 43257, 43256, 43257, 43257, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.9839793  0.7556508

classifier$resample %>% 
  dplyr::group_split(Resample) %>% 
  purrr::map(rowid_to_column) %>%
  dplyr::bind_rows() %>%
  ggplot(aes(rowid, Accuracy)) + geom_point() +
  geom_smooth(formula='y~x', method='loess', span=.03) +
  geom_line(classifier$results, mapping=aes(seq_along(Accuracy), Accuracy),
            size=2, color='red')

Next we’ll look at the Confusion Matrix from the model.

# confusionMatrix
THRESHOLD <- 0.5
out_of_folds_CM <- classifier$pred %>% 
  dplyr::mutate(pred2 = ifelse(Up>THRESHOLD, 'Up', 'Down')) %>%
  dplyr::mutate(pred2 = factor(pred2, levels=c('Down','Up'))) %>%
  dplyr::group_split(Resample) %>%
  purrr::map(~ caret::confusionMatrix(data=.x$pred2, reference=.x$obs))

out_of_folds_CM_vis <- out_of_folds_CM %>% 
  purrr::map(~ .x$table %>% broom::tidy() %>%
                 ggplot(aes(x=Reference, y=Prediction, fill=n, label=n)) +
                 geom_tile() + geom_text(aes(label=n), size=5, color='white'))
 
gridExtra::grid.arrange(grobs=out_of_folds_CM_vis, ncol=2)

From the 10-Fold cross validation we can see we have ~130 true positive references to a blue tarp. Slightly, less than the cross validation with KNN. We can measure the sensitivity and the 1-specificity from the initial default threshold is 0.5.

options(yardstick.event_first = FALSE)
ROC_curve <- predict(classifier, type='prob') %>% 
  yardstick::roc_curve(truth=train$BlueTarp_rest, estimate=Up) %>%
  dplyr::mutate(one_minus_specificity = 1-specificity)

# Sensitivity: TP/(TP+FN)
# 1-Specificity: FP/(TN+FP)

ROC_curve_plot <- ROC_curve %>%
  ggplot(aes(x=one_minus_specificity, y=sensitivity)) + 
  geom_line() + geom_point() +
  geom_abline(slope=1, intercept=0, linetype='dashed', color='red') +
  xlab("one_minus_specificity\n(false positive rate)") +
  ggtitle('ROC curve')

ggplotly(ROC_curve_plot)

predict(classifier, type='prob') %>% 
  yardstick::roc_auc(truth=train$BlueTarp_rest, Up)

## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary         0.990

ROC_curve %>% head()

## # A tibble: 6 x 4
##    .threshold specificity sensitivity one_minus_specificity
##         <dbl>       <dbl>       <dbl>                 <dbl>
## 1 -Inf          0                   1                  1   
## 2    1.61e-16   0                   1                  1   
## 3    4.23e-16   0.0000215           1                  1.00
## 4    3.16e-15   0.0000430           1                  1.00
## 5    9.92e-15   0.0000645           1                  1.00
## 6    1.40e-14   0.0000860           1                  1.00

predict(classifier, type='prob') %>% head()

##        Down           Up
## 1 0.9999983 1.688204e-06
## 2 0.9999990 1.007633e-06
## 3 0.9991272 8.727504e-04
## 4 0.9998474 1.525823e-04
## 5 0.9512878 4.871219e-02
## 6 0.9999993 6.572897e-07

We can also look at each ROC curve from each fold from the LDA model.

ROC_curveS <- classifier$pred %>%
  dplyr::group_split(Resample) %>% 
  purrr::map(~ yardstick::roc_curve(.x, truth=obs, Up)) %>%
  purrr::map(~ dplyr::mutate(.x, one_minus_specificity = 1-specificity)) %>%
  purrr::map(~ geom_line(.x, mapping=aes(one_minus_specificity, sensitivity)))

ROC_curveS[[1]] <- ggplot() + ROC_curveS[[1]]
Reduce("+", ROC_curveS) + 
  geom_abline(slope=1, intercept=0, linetype='dashed', color='red') +
  xlab("one_minus_specificity\n(false positive rate)") +
  ggtitle('ROC Curves')

To find the optimal threshold, well look at using the \(F_1\) Score.

tuned_call <- function(preds, decision_rule_threshold){
  calls <- rep('Down', length(preds))
  calls[preds>decision_rule_threshold]='Up'
  factor(calls, levels=c('Down','Up'))
}


thresholds = seq(0.01,0.99,.01) # Avoid Threshold Tails
F_1_scores <- rep(NA, length(thresholds))
GMeans <- rep(NA, length(thresholds))

preds <- predict(classifier, type='prob')[,2]
y <- train$BlueTarp_rest
for (i in 1:length(thresholds)){
  yhat <- tuned_call(preds, thresholds[i])
  one_minus_specificity <- mean(yhat[y=='Down']=='Up')
  sensitivity <- mean(yhat[y=='Up']=='Up')
  one_minus_specificity
  sensitivity
  results <- tibble::tibble(y=y, yhat=yhat)
  
  caret::confusionMatrix(yhat, y, positive = "Up") 
  cm <- results %>%yardstick::conf_mat(truth=y, yhat)
  
  trueNegative  = cm$table[1,1]
  falseNegative = cm$table[1,2]
  falsePositive = cm$table[2,1]
  truePositve   = cm$table[2,2]
  
  #paste0("True Positive: ", truePositve )
  #paste0("False Negative: ", falseNegative )
  #paste0("True Negative: ", trueNegative )
  #paste0("False Positive: ", falsePositive )
  conditionPositive = truePositve + falseNegative
  conditionNegative = trueNegative + falsePositive 
  predictedConditionPositive = truePositve +  falsePositive 
  predictedConditionNegative = trueNegative + falseNegative
  
  TotalPopulation = truePositve + falsePositive + trueNegative + falseNegative
  
  prevalence = conditionPositive / TotalPopulation
  accuracy = (truePositve + trueNegative) / TotalPopulation
  PPV = truePositve / predictedConditionPositive
  FDR = falsePositive / predictedConditionPositive
  FOR = falseNegative / predictedConditionNegative
  NPV = trueNegative / predictedConditionNegative
  TPR = truePositve / conditionPositive
  FPR = falsePositive / conditionNegative
  FNR = falseNegative / conditionPositive
  SPC = TNR = trueNegative / conditionNegative
  LR.PLUS = TPR / FPR
  LR.MINUS = FNR/TNR
  DOR = LR.PLUS / LR.MINUS
  
  F_1_score = 2 * ((PPV * TPR)/(PPV + TPR))
  F_1_scores[i] <- F_1_score
  GMeans[i] = sqrt(TPR*SPC)
}

opt_thresh = thresholds[tail(which(na.omit(F_1_scores) == min(na.omit(F_1_scores))), n=1)]

if (opt_thresh < 0.1){opt_thresh = 0.1}
if (opt_thresh > 0.5){opt_thresh = 0.5}

#opt_thresh

#points(opt_thresh, F_1_scores[tail(which(na.omit(F_1_scores) == min(na.omit(F_1_scores))), n=1)], col = 'red', cex = 2, pch = 19)

plot(thresholds, F_1_scores, xlab = 'Thresholds', ylab = 'F1 Score', type = 'l')

plot(thresholds, GMeans, xlab = 'Thresholds', ylab = 'Geometric Mean', type = 'l')

thresholds[tail(which(na.omit(GMeans) == max(na.omit(GMeans))), n=1)]

## [1] 0.01

paste0("Confusion Matrix 0.5")

## [1] "Confusion Matrix 0.5"

y <- train$BlueTarp_rest
yhat <- tuned_call(preds, 0.5)
one_minus_specificity <- mean(yhat[y=='Down']=='Up')
sensitivity <- mean(yhat[y=='Up']=='Up')
results <- tibble::tibble(y=y, yhat=yhat)

caret::confusionMatrix(yhat, y, positive = "Up")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  Down    Up
##       Down 46054   296
##       Up     466  1247
##                                          
##                Accuracy : 0.9841         
##                  95% CI : (0.983, 0.9852)
##     No Information Rate : 0.9679         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.7578         
##                                          
##  Mcnemar's Test P-Value : 9.228e-10      
##                                          
##             Sensitivity : 0.80817        
##             Specificity : 0.98998        
##          Pos Pred Value : 0.72796        
##          Neg Pred Value : 0.99361        
##              Prevalence : 0.03210        
##          Detection Rate : 0.02595        
##    Detection Prevalence : 0.03564        
##       Balanced Accuracy : 0.89907        
##                                          
##        'Positive' Class : Up             
##

cm <- results %>%yardstick::conf_mat(truth=y, yhat)

From the initial threshold we can see:

True Negative = 46297
False Negative = 522
False Positive = 223
True Positive = 1021
Condition Positive = 1543
condition Negative = 46520
Predicted Condition Positive = 1244
Predicted Condition Negative = 46819
Total Population = 48063
Prevalence = 0.032103697230718
Accuracy = 0.984499511058402
Positive predictive value, precision = 0.820739549839228
False Discovery rate = 0.179260450160772
False Omission Rate = 0.0111493197206262
Negative Predictive Value = 0.988850680279374
True Positive Rate, Recall, Sensitivity = 0.661697990926766
False Positive Rate = 0.00479363714531384
False Negative Rate = 0.338302009073234
Specificity, Selectivity, True Negative rate = 0.995206362854686
Positive likelihood ratio = 138.036728869566
Negative likelihood ratio = 0.339931517422011
Diagnostic Odds Ratio = 406.072169819425
\(F_1\) Score = 0.732687477574453

From our plot we selected our optimal threshold as 0.1.

yhat <- tuned_call(preds, opt_thresh)
one_minus_specificity <- mean(yhat[y=='Down']=='Up')
sensitivity <- mean(yhat[y=='Up']=='Up')
one_minus_specificity

## [1] 0.01319862

sensitivity

## [1] 0.8561244

results <- tibble::tibble(y=y, yhat=yhat)

caret::confusionMatrix(yhat, y, positive = "Up")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  Down    Up
##       Down 45906   222
##       Up     614  1321
##                                           
##                Accuracy : 0.9826          
##                  95% CI : (0.9814, 0.9838)
##     No Information Rate : 0.9679          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7507          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.85612         
##             Specificity : 0.98680         
##          Pos Pred Value : 0.68269         
##          Neg Pred Value : 0.99519         
##              Prevalence : 0.03210         
##          Detection Rate : 0.02748         
##    Detection Prevalence : 0.04026         
##       Balanced Accuracy : 0.92146         
##                                           
##        'Positive' Class : Up              
##

cm <- results %>%yardstick::conf_mat(truth=y, yhat)

trueNegative  = cm$table[1,1]
falseNegative = cm$table[1,2]
falsePositive = cm$table[2,1]
truePositve   = cm$table[2,2]

conditionPositive = truePositve + falseNegative
conditionNegative = trueNegative + falsePositive 
predictedConditionPositive = truePositve +  falsePositive 
predictedConditionNegative = trueNegative + falseNegative

TotalPopulation = truePositve + falsePositive + trueNegative + falseNegative

prevalence = conditionPositive / TotalPopulation
accuracy = (truePositve + trueNegative) / TotalPopulation
PPV = truePositve / predictedConditionPositive
FDR = falsePositive / predictedConditionPositive
FOR = falseNegative / predictedConditionNegative
NPV = trueNegative / predictedConditionNegative
TPR = truePositve / conditionPositive
FPR = falsePositive / conditionNegative
FNR = falseNegative / conditionPositive
SPC = TNR = trueNegative / conditionNegative
LR.PLUS = TPR / FPR
LR.MINUS = FNR/TNR
DOR = LR.PLUS / LR.MINUS

F_1_score = 2 * ((PPV * TPR)/(PPV + TPR))

Now using the optimal threshold we get:

True Negative = 45906
False Negative = 222
False Positive = 614
True Positive = 1321
Condition Positive = 1543
condition Negative = 46520
Predicted Condition Positive = 1935
Predicted Condition Negative = 46128
Total Population = 48063
Prevalence = 0.032103697230718
Accuracy = 0.982606162744731
Positive predictive value, precision = 0.682687338501292
False Discovery rate = 0.317312661498708
False Omission Rate = 0.00481269510926119
Negative Predictive Value = 0.995187304890739
True Positive Rate, Recall, Sensitivity = 0.856124432922878
False Positive Rate = 0.0131986242476354
False Negative Rate = 0.143875567077122
Specificity, Selectivity, True Negative rate = 0.986801375752365
Positive likelihood ratio = 64.8646720188473
Negative likelihood ratio = 0.145799925509252
Diagnostic Odds Ratio = 444.888238401268
\(F_1\) Score = 0.75963197239793

# confusionMatrix

THRESHOLD <- opt_thresh
out_of_folds_CM <- classifier$pred %>% 
  dplyr::mutate(pred2 = ifelse(Up>THRESHOLD, 'Up', 'Down')) %>%
  dplyr::mutate(pred2 = factor(pred2, levels=c('Down','Up'))) %>%
  dplyr::group_split(Resample) %>%
  purrr::map(~ caret::confusionMatrix(data=.x$pred2, reference=.x$obs))

out_of_folds_CM_vis <- out_of_folds_CM %>% 
  purrr::map(~ .x$table %>% broom::tidy() %>%
                 ggplot(aes(x=Reference, y=Prediction, fill=n, label=n)) +
                 geom_tile() + geom_text(aes(label=n), size=5, color='white'))
 
gridExtra::grid.arrange(grobs=out_of_folds_CM_vis, ncol=2)

Summary Stats LDA

options(yardstick.event_first = FALSE)
ROC_curve <- predict(classifier, type='prob') %>% 
  yardstick::roc_curve(truth=train$BlueTarp_rest, estimate=Up) %>%
  dplyr::mutate(one_minus_specificity = 1-specificity)

# Sensitivity: TP/(TP+FN)
# 1-Specificity: FP/(TN+FP)

ROC_curve_plot <- ROC_curve %>%
  ggplot(aes(x=one_minus_specificity, y=sensitivity)) + 
  geom_line() + geom_point() +
  geom_abline(slope=1, intercept=0, linetype='dashed', color='red') +
  xlab("one_minus_specificity\n(false positive rate)") +
  ggtitle('ROC curve')

ggplotly(ROC_curve_plot)

temp = predict(classifier, type='prob') %>% yardstick::roc_auc(truth=train$BlueTarp_rest, Up) 

lda.Accuracy = accuracy
lda.AUC = temp$.estimate
twoclassSum = twoClassSummary( data = data.frame(obs = classifier$pred$obs,
  pred = classifier$pred$pred,
  Down = classifier$pred$Down, 
  Up = classifier$pred$Up), lev = levels(classifier$pred$obs))

lda.ROC = "TRUE" #unname(twoclassSum[1])
lda.Threshold = opt_thresh
lda.Sensitivity = lda.Recall= lda.Power =  TPR
lda.Specificity= 1-FPR
lda.FDR = FDR
lda.Precision=PPV

df = data.frame(c(lda.Accuracy, lda.AUC, lda.ROC, lda.Threshold, lda.Sensitivity, lda.Specificity, lda.FDR , lda.Precision))

colnames(df) <- c("LDA")
rownames(df) <- c("Accuracy","AUC", "ROC", "Threshold","Sensitivity=Recall=Power","Specificity=1-FPR","FDR","Precision=PPV")
df %>% kbl() %>% kable_styling()

	LDA
Accuracy	0.982606162744731
AUC	0.989589213539748
ROC	TRUE
Threshold	0.1
Sensitivity=Recall=Power	0.856124432922878
Specificity=1-FPR	0.986801375752365
FDR	0.317312661498708
Precision=PPV	0.682687338501292

QDA

The next classification process that we used is the quadratic discriminant analysis (qda). Like LDA, the QDA classifier results from assuming that the observations from each class are drawn from a Gaussian distribution, and plugging estimates for the parameters into Bayes’ theorem in order to perform prediction. Again, we’ll use the caret package in R to apply the qda classification procedure using a 10 - Fold Cross-Validation evaluation procedure.

set.seed(1)
train$BlueTarp_rest <- as.factor(train$BlueTarp_rest)
levels(train$BlueTarp_rest) <- c("Down","Up")
# https://blog.revolutionanalytics.com/2016/05/using-caret-to-compare-models.html

classifier <- caret::train(BlueTarp_rest ~ Red + Green + Blue, data=train, method="qda", 
                           preProcess=c("center","scale"),
                           #tuneGrid=data.frame(.dimen = seq(1,10,1)),#seq(1,1000,5)),
                           trControl = caret::trainControl("cv", number=kfoldcv, 
                                                           returnResamp='all', savePredictions='final',
                                                           classProbs=TRUE))

classifier

## Quadratic Discriminant Analysis 
## 
## 48063 samples
##     3 predictor
##     2 classes: 'Down', 'Up' 
## 
## Pre-processing: centered (3), scaled (3) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 43257, 43256, 43257, 43256, 43257, 43257, ... 
## Resampling results:
## 
##   Accuracy  Kappa    
##   0.994861  0.9110695

classifier$resample %>% 
  dplyr::group_split(Resample) %>% 
  purrr::map(rowid_to_column) %>%
  dplyr::bind_rows() %>%
  ggplot(aes(rowid, Accuracy)) + geom_point() +
  geom_smooth(formula='y~x', method='loess', span=.03) +
  geom_line(classifier$results, mapping=aes(seq_along(Accuracy), Accuracy),
            size=2, color='red')

Next we’ll look at the Confusion Matrix from the model.

# confusionMatrix
THRESHOLD <- 0.5
out_of_folds_CM <- classifier$pred %>% 
  dplyr::mutate(pred2 = ifelse(Up>THRESHOLD, 'Up', 'Down')) %>%
  dplyr::mutate(pred2 = factor(pred2, levels=c('Down','Up'))) %>%
  dplyr::group_split(Resample) %>%
  purrr::map(~ caret::confusionMatrix(data=.x$pred2, reference=.x$obs))

out_of_folds_CM_vis <- out_of_folds_CM %>% 
  purrr::map(~ .x$table %>% broom::tidy() %>%
                 ggplot(aes(x=Reference, y=Prediction, fill=n, label=n)) +
                 geom_tile() + geom_text(aes(label=n), size=5, color='white'))
 
gridExtra::grid.arrange(grobs=out_of_folds_CM_vis, ncol=2)

From the 10-Fold cross validation we can see we have true positive references to a blue tarp on par with what we see in LDA. We can measure the sensitivity and the 1-specificity from the initial default threshold is 0.5.

options(yardstick.event_first = FALSE)
ROC_curve <- predict(classifier, type='prob') %>% 
  yardstick::roc_curve(truth=train$BlueTarp_rest, estimate=Up) %>%
  dplyr::mutate(one_minus_specificity = 1-specificity)

# Sensitivity: TP/(TP+FN)
# 1-Specificity: FP/(TN+FP)

ROC_curve_plot <- ROC_curve %>%
  ggplot(aes(x=one_minus_specificity, y=sensitivity)) + 
  geom_line() + geom_point() +
  geom_abline(slope=1, intercept=0, linetype='dashed', color='red') +
  xlab("one_minus_specificity\n(false positive rate)") +
  ggtitle('ROC curve')

ggplotly(ROC_curve_plot)

predict(classifier, type='prob') %>% 
  yardstick::roc_auc(truth=train$BlueTarp_rest, Up)

## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary         0.998

ROC_curve %>% head()

## # A tibble: 6 x 4
##    .threshold specificity sensitivity one_minus_specificity
##         <dbl>       <dbl>       <dbl>                 <dbl>
## 1 -Inf          0                   1                  1   
## 2    6.67e-31   0                   1                  1   
## 3    6.20e-29   0.0000215           1                  1.00
## 4    6.95e-29   0.0000430           1                  1.00
## 5    2.33e-28   0.0000645           1                  1.00
## 6    6.82e-28   0.0000860           1                  1.00

predict(classifier, type='prob') %>% head()

##        Down           Up
## 1 1.0000000 9.445049e-09
## 2 0.9979097 2.090319e-03
## 3 0.9999612 3.875654e-05
## 4 0.9999984 1.558411e-06
## 5 0.9996290 3.710440e-04
## 6 0.9999469 5.309289e-05

We can also look at each ROC curve from each fold from the QDA model.

ROC_curveS <- classifier$pred %>%
  dplyr::group_split(Resample) %>% 
  purrr::map(~ yardstick::roc_curve(.x, truth=obs, Up)) %>%
  purrr::map(~ dplyr::mutate(.x, one_minus_specificity = 1-specificity)) %>%
  purrr::map(~ geom_line(.x, mapping=aes(one_minus_specificity, sensitivity)))

ROC_curveS[[1]] <- ggplot() + ROC_curveS[[1]]
Reduce("+", ROC_curveS) + 
  geom_abline(slope=1, intercept=0, linetype='dashed', color='red') +
  xlab("one_minus_specificity\n(false positive rate)") +
  ggtitle('ROC Curves')

To find the optimal threshold, well look at using the \(F_1\) Score.

tuned_call <- function(preds, decision_rule_threshold){
  calls <- rep('Down', length(preds))
  calls[preds>decision_rule_threshold]='Up'
  factor(calls, levels=c('Down','Up'))
}


thresholds = seq(0.01,0.99,.01) # Avoid Threshold Tails
F_1_scores <- rep(NA, length(thresholds))
GMeans <- rep(NA, length(thresholds))

preds <- predict(classifier, type='prob')[,2]
y <- train$BlueTarp_rest
for (i in 1:length(thresholds)){
  yhat <- tuned_call(preds, thresholds[i])
  one_minus_specificity <- mean(yhat[y=='Down']=='Up')
  sensitivity <- mean(yhat[y=='Up']=='Up')
  one_minus_specificity
  sensitivity
  results <- tibble::tibble(y=y, yhat=yhat)
  
  caret::confusionMatrix(yhat, y, positive = "Up") 
  cm <- results %>%yardstick::conf_mat(truth=y, yhat)
  
  trueNegative  = cm$table[1,1]
  falseNegative = cm$table[1,2]
  falsePositive = cm$table[2,1]
  truePositve   = cm$table[2,2]
  
  #paste0("True Positive: ", truePositve )
  #paste0("False Negative: ", falseNegative )
  #paste0("True Negative: ", trueNegative )
  #paste0("False Positive: ", falsePositive )
  conditionPositive = truePositve + falseNegative
  conditionNegative = trueNegative + falsePositive 
  predictedConditionPositive = truePositve +  falsePositive 
  predictedConditionNegative = trueNegative + falseNegative
  
  TotalPopulation = truePositve + falsePositive + trueNegative + falseNegative
  
  prevalence = conditionPositive / TotalPopulation
  accuracy = (truePositve + trueNegative) / TotalPopulation
  PPV = truePositve / predictedConditionPositive
  FDR = falsePositive / predictedConditionPositive
  FOR = falseNegative / predictedConditionNegative
  NPV = trueNegative / predictedConditionNegative
  TPR = truePositve / conditionPositive
  FPR = falsePositive / conditionNegative
  FNR = falseNegative / conditionPositive
  SPC = TNR = trueNegative / conditionNegative
  LR.PLUS = TPR / FPR
  LR.MINUS = FNR/TNR
  DOR = LR.PLUS / LR.MINUS
  
  F_1_score = 2 * ((PPV * TPR)/(PPV + TPR))
  F_1_scores[i] <- F_1_score
  GMeans[i] = sqrt(TPR*SPC)
}

opt_thresh = thresholds[tail(which(na.omit(F_1_scores) == min(na.omit(F_1_scores))), n=1)]

if (opt_thresh < 0.1){opt_thresh = 0.1}
if (opt_thresh > 0.5){opt_thresh = 0.5}

#opt_thresh

#points(opt_thresh, F_1_scores[tail(which(na.omit(F_1_scores) == min(na.omit(F_1_scores))), n=1)], col = 'red', cex = 2, pch = 19)

plot(thresholds, F_1_scores, xlab = 'Thresholds', ylab = 'F1 Score', type = 'l')

plot(thresholds, GMeans, xlab = 'Thresholds', ylab = 'Geometric Mean', type = 'l')

thresholds[tail(which(na.omit(GMeans) == max(na.omit(GMeans))), n=1)]

## [1] 0.02

paste0("Confusion Matrix 0.5")

## [1] "Confusion Matrix 0.5"

y <- train$BlueTarp_rest
yhat <- tuned_call(preds, 0.5)
one_minus_specificity <- mean(yhat[y=='Down']=='Up')
sensitivity <- mean(yhat[y=='Up']=='Up')
results <- tibble::tibble(y=y, yhat=yhat)

caret::confusionMatrix(yhat, y, positive = "Up")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  Down    Up
##       Down 46506   232
##       Up      14  1311
##                                           
##                Accuracy : 0.9949          
##                  95% CI : (0.9942, 0.9955)
##     No Information Rate : 0.9679          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9116          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.84964         
##             Specificity : 0.99970         
##          Pos Pred Value : 0.98943         
##          Neg Pred Value : 0.99504         
##              Prevalence : 0.03210         
##          Detection Rate : 0.02728         
##    Detection Prevalence : 0.02757         
##       Balanced Accuracy : 0.92467         
##                                           
##        'Positive' Class : Up              
##

cm <- results %>%yardstick::conf_mat(truth=y, yhat)

From the initial threshold we can see:

True Negative = 46520
False Negative = 504
False Positive = 0
True Positive = 1039
Condition Positive = 1543
condition Negative = 46520
Predicted Condition Positive = 1039
Predicted Condition Negative = 47024
Total Population = 48063
Prevalence = 0.032103697230718
Accuracy = 0.989513763185819
Positive predictive value, precision = 1
False Discovery rate = 0
False Omission Rate = 0.0107179312691392
Negative Predictive Value = 0.989282068730861
True Positive Rate, Recall, Sensitivity = 0.673363577446533
False Positive Rate = 0
False Negative Rate = 0.326636422553467
Specificity, Selectivity, True Negative rate = 1
Positive likelihood ratio = Inf
Negative likelihood ratio = 0.326636422553467
Diagnostic Odds Ratio = Inf
\(F_1\) Score = 0.804802478698683

From our plot we selected our optimal threshold as 0.1.

yhat <- tuned_call(preds, opt_thresh)
one_minus_specificity <- mean(yhat[y=='Down']=='Up')
sensitivity <- mean(yhat[y=='Up']=='Up')
one_minus_specificity

## [1] 0.007050731

sensitivity

## [1] 0.9086196

results <- tibble::tibble(y=y, yhat=yhat)

caret::confusionMatrix(yhat, y, positive = "Up")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  Down    Up
##       Down 46192   141
##       Up     328  1402
##                                           
##                Accuracy : 0.9902          
##                  95% CI : (0.9893, 0.9911)
##     No Information Rate : 0.9679          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8517          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.90862         
##             Specificity : 0.99295         
##          Pos Pred Value : 0.81040         
##          Neg Pred Value : 0.99696         
##              Prevalence : 0.03210         
##          Detection Rate : 0.02917         
##    Detection Prevalence : 0.03599         
##       Balanced Accuracy : 0.95078         
##                                           
##        'Positive' Class : Up              
##

cm <- results %>%yardstick::conf_mat(truth=y, yhat)

trueNegative  = cm$table[1,1]
falseNegative = cm$table[1,2]
falsePositive = cm$table[2,1]
truePositve   = cm$table[2,2]

conditionPositive = truePositve + falseNegative
conditionNegative = trueNegative + falsePositive 
predictedConditionPositive = truePositve +  falsePositive 
predictedConditionNegative = trueNegative + falseNegative

TotalPopulation = truePositve + falsePositive + trueNegative + falseNegative

prevalence = conditionPositive / TotalPopulation
accuracy = (truePositve + trueNegative) / TotalPopulation
PPV = truePositve / predictedConditionPositive
FDR = falsePositive / predictedConditionPositive
FOR = falseNegative / predictedConditionNegative
NPV = trueNegative / predictedConditionNegative
TPR = truePositve / conditionPositive
FPR = falsePositive / conditionNegative
FNR = falseNegative / conditionPositive
SPC = TNR = trueNegative / conditionNegative
LR.PLUS = TPR / FPR
LR.MINUS = FNR/TNR
DOR = LR.PLUS / LR.MINUS

F_1_score = 2 * ((PPV * TPR)/(PPV + TPR))

Now using the optimal threshold we get:

True Negative = 46192
False Negative = 141
False Positive = 328
True Positive = 1402
Condition Positive = 1543
condition Negative = 46520
Predicted Condition Positive = 1730
Predicted Condition Negative = 46333
Total Population = 48063
Prevalence = 0.032103697230718
Accuracy = 0.990241974075692
Positive predictive value, precision = 0.810404624277457
False Discovery rate = 0.189595375722543
False Omission Rate = 0.00304318736106015
Negative Predictive Value = 0.99695681263894
True Positive Rate, Recall, Sensitivity = 0.908619572261828
False Positive Rate = 0.00705073086844368
False Negative Rate = 0.0913804277381724
Specificity, Selectivity, True Negative rate = 0.992949269131556
Positive likelihood ratio = 128.868849090306
Negative likelihood ratio = 0.0920293015755927
Diagnostic Odds Ratio = 1400.30236983221
\(F_1\) Score = 0.85670638557898

# confusionMatrix

THRESHOLD <- opt_thresh
out_of_folds_CM <- classifier$pred %>% 
  dplyr::mutate(pred2 = ifelse(Up>THRESHOLD, 'Up', 'Down')) %>%
  dplyr::mutate(pred2 = factor(pred2, levels=c('Down','Up'))) %>%
  dplyr::group_split(Resample) %>%
  purrr::map(~ caret::confusionMatrix(data=.x$pred2, reference=.x$obs))

out_of_folds_CM_vis <- out_of_folds_CM %>% 
  purrr::map(~ .x$table %>% broom::tidy() %>%
                 ggplot(aes(x=Reference, y=Prediction, fill=n, label=n)) +
                 geom_tile() + geom_text(aes(label=n), size=5, color='white'))
 
gridExtra::grid.arrange(grobs=out_of_folds_CM_vis, ncol=2)

Summary Stats QDA

options(yardstick.event_first = FALSE)
ROC_curve <- predict(classifier, type='prob') %>% 
  yardstick::roc_curve(truth=train$BlueTarp_rest, estimate=Up) %>%
  dplyr::mutate(one_minus_specificity = 1-specificity)

# Sensitivity: TP/(TP+FN)
# 1-Specificity: FP/(TN+FP)

ROC_curve_plot <- ROC_curve %>%
  ggplot(aes(x=one_minus_specificity, y=sensitivity)) + 
  geom_line() + geom_point() +
  geom_abline(slope=1, intercept=0, linetype='dashed', color='red') +
  xlab("one_minus_specificity\n(false positive rate)") +
  ggtitle('ROC curve')

ggplotly(ROC_curve_plot)

temp = predict(classifier, type='prob') %>% yardstick::roc_auc(truth=train$BlueTarp_rest, Up) 

qda.Accuracy = accuracy
qda.AUC = temp$.estimate
twoclassSum = twoClassSummary( data = data.frame(obs = classifier$pred$obs,
  pred = classifier$pred$pred,
  Down = classifier$pred$Down, 
  Up = classifier$pred$Up), lev = levels(classifier$pred$obs))

qda.ROC = "TRUE" #unname(twoclassSum[1])
qda.Threshold = opt_thresh
qda.Sensitivity = qda.Recall= qda.Power =  TPR
qda.Specificity= 1-FPR
qda.FDR = FDR
qda.Precision=PPV

df = data.frame(c(qda.Accuracy, qda.AUC, qda.ROC, qda.Threshold, qda.Sensitivity, qda.Specificity, qda.FDR , qda.Precision))

colnames(df) <- c("QDA")
rownames(df) <- c("Accuracy","AUC", "ROC", "Threshold","Sensitivity=Recall=Power","Specificity=1-FPR","FDR","Precision=PPV")
df %>% kbl() %>% kable_styling()

	QDA
Accuracy	0.990241974075692
AUC	0.998273998625808
ROC	TRUE
Threshold	0.1
Sensitivity=Recall=Power	0.908619572261828
Specificity=1-FPR	0.992949269131556
FDR	0.189595375722543
Precision=PPV	0.810404624277457

Logistic Regression

Our final classification process that we used is the linear discriminant analysis (glm). Again, we’ll use the caret package in R to apply the glm classification procedure using a 10 - Fold Cross-Validation evaluation procedure.

set.seed(1)
train$BlueTarp_rest <- as.factor(train$BlueTarp_rest)
levels(train$BlueTarp_rest) <- c("Down","Up")
# https://blog.revolutionanalytics.com/2016/05/using-caret-to-compare-models.html

classifier <- caret::train(BlueTarp_rest ~ Red + Green + Blue, data=train, method="glm", family=binomial(),
      preProcess=c("center","scale"), trControl = caret::trainControl("cv", number=kfoldcv, 
      returnResamp='all', savePredictions='final',classProbs=TRUE))

classifier

## Generalized Linear Model 
## 
## 48063 samples
##     3 predictor
##     2 classes: 'Down', 'Up' 
## 
## Pre-processing: centered (3), scaled (3) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 43257, 43256, 43257, 43256, 43257, 43257, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.9954435  0.9238951

classifier$resample %>% 
  dplyr::group_split(Resample) %>% 
  purrr::map(rowid_to_column) %>%
  dplyr::bind_rows() %>%
  ggplot(aes(rowid, Accuracy)) + geom_point() +
  geom_smooth(formula='y~x', method='loess', span=.03) +
  geom_line(classifier$results, mapping=aes(seq_along(Accuracy), Accuracy),
            size=2, color='red')

Next we’ll look at the Confusion Matrix from the model.

# confusionMatrix
THRESHOLD <- 0.5
out_of_folds_CM <- classifier$pred %>% 
  dplyr::mutate(pred2 = ifelse(Up>THRESHOLD, 'Up', 'Down')) %>%
  dplyr::mutate(pred2 = factor(pred2, levels=c('Down','Up'))) %>%
  dplyr::group_split(Resample) %>%
  purrr::map(~ caret::confusionMatrix(data=.x$pred2, reference=.x$obs))

out_of_folds_CM_vis <- out_of_folds_CM %>% 
  purrr::map(~ .x$table %>% broom::tidy() %>%
                 ggplot(aes(x=Reference, y=Prediction, fill=n, label=n)) +
                 geom_tile() + geom_text(aes(label=n), size=5, color='white'))
 
gridExtra::grid.arrange(grobs=out_of_folds_CM_vis, ncol=2)

From the 10-Fold cross validation we can see we have ~130-140 true positive references to a blue tarp. We can measure the sensitivity and the 1-specificity from the initial default threshold is 0.5.

options(yardstick.event_first = FALSE)
ROC_curve <- predict(classifier, type='prob') %>% 
  yardstick::roc_curve(truth=train$BlueTarp_rest, estimate=Up) %>%
  dplyr::mutate(one_minus_specificity = 1-specificity)

# Sensitivity: TP/(TP+FN)
# 1-Specificity: FP/(TN+FP)

ROC_curve_plot <- ROC_curve %>%
  ggplot(aes(x=one_minus_specificity, y=sensitivity)) + 
  geom_line() + geom_point() +
  geom_abline(slope=1, intercept=0, linetype='dashed', color='red') +
  xlab("one_minus_specificity\n(false positive rate)") +
  ggtitle('ROC curve')

ggplotly(ROC_curve_plot)

predict(classifier, type='prob') %>% 
  yardstick::roc_auc(truth=train$BlueTarp_rest, Up)

## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary         0.999

ROC_curve %>% head()

## # A tibble: 6 x 4
##    .threshold specificity sensitivity one_minus_specificity
##         <dbl>       <dbl>       <dbl>                 <dbl>
## 1 -Inf              0               1                 1    
## 2    2.22e-16       0               1                 1    
## 3    9.36e-14       0.220           1                 0.780
## 4    9.36e-14       0.220           1                 0.780
## 5    9.38e-14       0.220           1                 0.780
## 6    9.38e-14       0.220           1                 0.780

predict(classifier, type='prob') %>% head()

##        Down           Up
## 1 1.0000000 2.864604e-12
## 2 0.9992914 7.086498e-04
## 3 0.9999997 2.521973e-07
## 4 1.0000000 1.031507e-09
## 5 0.9999971 2.906881e-06
## 6 0.9944786 5.521443e-03

We can also look at each ROC curve from each fold from the glm model.

ROC_curveS <- classifier$pred %>%
  dplyr::group_split(Resample) %>% 
  purrr::map(~ yardstick::roc_curve(.x, truth=obs, Up)) %>%
  purrr::map(~ dplyr::mutate(.x, one_minus_specificity = 1-specificity)) %>%
  purrr::map(~ geom_line(.x, mapping=aes(one_minus_specificity, sensitivity)))

ROC_curveS[[1]] <- ggplot() + ROC_curveS[[1]]
Reduce("+", ROC_curveS) + 
  geom_abline(slope=1, intercept=0, linetype='dashed', color='red') +
  xlab("one_minus_specificity\n(false positive rate)") +
  ggtitle('ROC Curves')

To find the optimal threshold, well look at using the \(F_1\) Score.

tuned_call <- function(preds, decision_rule_threshold){
  calls <- rep('Down', length(preds))
  calls[preds>decision_rule_threshold]='Up'
  factor(calls, levels=c('Down','Up'))
}


thresholds = seq(0.01,0.99,.01) # Avoid Threshold Tails
F_1_scores <- rep(NA, length(thresholds))
GMeans <- rep(NA, length(thresholds))

preds <- predict(classifier, type='prob')[,2]
y <- train$BlueTarp_rest
for (i in 1:length(thresholds)){
  yhat <- tuned_call(preds, thresholds[i])
  one_minus_specificity <- mean(yhat[y=='Down']=='Up')
  sensitivity <- mean(yhat[y=='Up']=='Up')
  one_minus_specificity
  sensitivity
  results <- tibble::tibble(y=y, yhat=yhat)
  
  caret::confusionMatrix(yhat, y, positive = "Up") 
  cm <- results %>%yardstick::conf_mat(truth=y, yhat)
  
  trueNegative  = cm$table[1,1]
  falseNegative = cm$table[1,2]
  falsePositive = cm$table[2,1]
  truePositve   = cm$table[2,2]
  
  #paste0("True Positive: ", truePositve )
  #paste0("False Negative: ", falseNegative )
  #paste0("True Negative: ", trueNegative )
  #paste0("False Positive: ", falsePositive )
  conditionPositive = truePositve + falseNegative
  conditionNegative = trueNegative + falsePositive 
  predictedConditionPositive = truePositve +  falsePositive 
  predictedConditionNegative = trueNegative + falseNegative
  
  TotalPopulation = truePositve + falsePositive + trueNegative + falseNegative
  
  prevalence = conditionPositive / TotalPopulation
  accuracy = (truePositve + trueNegative) / TotalPopulation
  PPV = truePositve / predictedConditionPositive
  FDR = falsePositive / predictedConditionPositive
  FOR = falseNegative / predictedConditionNegative
  NPV = trueNegative / predictedConditionNegative
  TPR = truePositve / conditionPositive
  FPR = falsePositive / conditionNegative
  FNR = falseNegative / conditionPositive
  SPC = TNR = trueNegative / conditionNegative
  LR.PLUS = TPR / FPR
  LR.MINUS = FNR/TNR
  DOR = LR.PLUS / LR.MINUS
  
  F_1_score = 2 * ((PPV * TPR)/(PPV + TPR))
  F_1_scores[i] <- F_1_score
  GMeans[i] = sqrt(TPR*SPC)
}

opt_thresh = thresholds[tail(which(na.omit(F_1_scores) == min(na.omit(F_1_scores))), n=1)]

if (opt_thresh < 0.1){opt_thresh = 0.1}
if (opt_thresh > 0.5){opt_thresh = 0.5}

#opt_thresh

#points(opt_thresh, F_1_scores[tail(which(na.omit(F_1_scores) == min(na.omit(F_1_scores))), n=1)], col = 'red', cex = 2, pch = 19)

plot(thresholds, F_1_scores, xlab = 'Thresholds', ylab = 'F1 Score', type = 'l')

plot(thresholds, GMeans, xlab = 'Thresholds', ylab = 'Geometric Mean', type = 'l')

thresholds[tail(which(na.omit(GMeans) == max(na.omit(GMeans))), n=1)]

## [1] 0.04

paste0("Confusion Matrix 0.5")

## [1] "Confusion Matrix 0.5"

y <- train$BlueTarp_rest
yhat <- tuned_call(preds, 0.5)
one_minus_specificity <- mean(yhat[y=='Down']=='Up')
sensitivity <- mean(yhat[y=='Up']=='Up')
results <- tibble::tibble(y=y, yhat=yhat)

caret::confusionMatrix(yhat, y, positive = "Up")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  Down    Up
##       Down 46467   166
##       Up      53  1377
##                                          
##                Accuracy : 0.9954         
##                  95% CI : (0.9948, 0.996)
##     No Information Rate : 0.9679         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.924          
##                                          
##  Mcnemar's Test P-Value : 3.783e-14      
##                                          
##             Sensitivity : 0.89242        
##             Specificity : 0.99886        
##          Pos Pred Value : 0.96294        
##          Neg Pred Value : 0.99644        
##              Prevalence : 0.03210        
##          Detection Rate : 0.02865        
##    Detection Prevalence : 0.02975        
##       Balanced Accuracy : 0.94564        
##                                          
##        'Positive' Class : Up             
##

cm <- results %>%yardstick::conf_mat(truth=y, yhat)

From the initial threshold we can see:

True Negative = 46519
False Negative = 403
False Positive = 1
True Positive = 1140
Condition Positive = 1543
condition Negative = 46520
Predicted Condition Positive = 1141
Predicted Condition Negative = 46922
Total Population = 48063
Prevalence = 0.032103697230718
Accuracy = 0.991594365728315
Positive predictive value, precision = 0.999123575810692
False Discovery rate = 0.000876424189307625
False Omission Rate = 0.00858872170836708
Negative Predictive Value = 0.991411278291633
True Positive Rate, Recall, Sensitivity = 0.738820479585224
False Positive Rate = 2.14961306964746e-05
False Negative Rate = 0.261179520414776
Specificity, Selectivity, True Negative rate = 0.999978503869304
Positive likelihood ratio = 34369.9287103046
Negative likelihood ratio = 0.261185134884572
Diagnostic Odds Ratio = 131592.208436725
\(F_1\) Score = 0.849478390461997

From our plot we selected our optimal threshold as 0.1.

yhat <- tuned_call(preds, opt_thresh)
one_minus_specificity <- mean(yhat[y=='Down']=='Up')
sensitivity <- mean(yhat[y=='Up']=='Up')
one_minus_specificity

## [1] 0.00827601

sensitivity

## [1] 0.9643552

results <- tibble::tibble(y=y, yhat=yhat)

caret::confusionMatrix(yhat, y, positive = "Up")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  Down    Up
##       Down 46135    55
##       Up     385  1488
##                                         
##                Accuracy : 0.9908        
##                  95% CI : (0.99, 0.9917)
##     No Information Rate : 0.9679        
##     P-Value [Acc > NIR] : < 2.2e-16     
##                                         
##                   Kappa : 0.8665        
##                                         
##  Mcnemar's Test P-Value : < 2.2e-16     
##                                         
##             Sensitivity : 0.96436       
##             Specificity : 0.99172       
##          Pos Pred Value : 0.79445       
##          Neg Pred Value : 0.99881       
##              Prevalence : 0.03210       
##          Detection Rate : 0.03096       
##    Detection Prevalence : 0.03897       
##       Balanced Accuracy : 0.97804       
##                                         
##        'Positive' Class : Up            
##

cm <- results %>%yardstick::conf_mat(truth=y, yhat)

trueNegative  = cm$table[1,1]
falseNegative = cm$table[1,2]
falsePositive = cm$table[2,1]
truePositve   = cm$table[2,2]

conditionPositive = truePositve + falseNegative
conditionNegative = trueNegative + falsePositive 
predictedConditionPositive = truePositve +  falsePositive 
predictedConditionNegative = trueNegative + falseNegative

TotalPopulation = truePositve + falsePositive + trueNegative + falseNegative

prevalence = conditionPositive / TotalPopulation
accuracy = (truePositve + trueNegative) / TotalPopulation
PPV = truePositve / predictedConditionPositive
FDR = falsePositive / predictedConditionPositive
FOR = falseNegative / predictedConditionNegative
NPV = trueNegative / predictedConditionNegative
TPR = truePositve / conditionPositive
FPR = falsePositive / conditionNegative
FNR = falseNegative / conditionPositive
SPC = TNR = trueNegative / conditionNegative
LR.PLUS = TPR / FPR
LR.MINUS = FNR/TNR
DOR = LR.PLUS / LR.MINUS

F_1_score = 2 * ((PPV * TPR)/(PPV + TPR))

Now using the optimal threshold we get:

True Negative = 46135
False Negative = 55
False Positive = 385
True Positive = 1488
Condition Positive = 1543
condition Negative = 46520
Predicted Condition Positive = 1873
Predicted Condition Negative = 46190
Total Population = 48063
Prevalence = 0.032103697230718
Accuracy = 0.990845348813016
Positive predictive value, precision = 0.794447410571276
False Discovery rate = 0.205552589428724
False Omission Rate = 0.00119073392509201
Negative Predictive Value = 0.998809266074908
True Positive Rate, Recall, Sensitivity = 0.964355152300713
False Positive Rate = 0.00827601031814273
False Negative Rate = 0.0356448476992871
Specificity, Selectivity, True Negative rate = 0.991723989681857
Positive likelihood ratio = 116.524160220855
Negative likelihood ratio = 0.0359423065995629
Diagnostic Odds Ratio = 3241.97780401417
\(F_1\) Score = 0.871194379391101

# confusionMatrix

THRESHOLD <- opt_thresh
out_of_folds_CM <- classifier$pred %>% 
  dplyr::mutate(pred2 = ifelse(Up>THRESHOLD, 'Up', 'Down')) %>%
  dplyr::mutate(pred2 = factor(pred2, levels=c('Down','Up'))) %>%
  dplyr::group_split(Resample) %>%
  purrr::map(~ caret::confusionMatrix(data=.x$pred2, reference=.x$obs))

out_of_folds_CM_vis <- out_of_folds_CM %>% 
  purrr::map(~ .x$table %>% broom::tidy() %>%
                 ggplot(aes(x=Reference, y=Prediction, fill=n, label=n)) +
                 geom_tile() + geom_text(aes(label=n), size=5, color='white'))
 
gridExtra::grid.arrange(grobs=out_of_folds_CM_vis, ncol=2)

Summary Stats Logistic Regression

options(yardstick.event_first = FALSE)
ROC_curve <- predict(classifier, type='prob') %>% 
  yardstick::roc_curve(truth=train$BlueTarp_rest, estimate=Up) %>%
  dplyr::mutate(one_minus_specificity = 1-specificity)

# Sensitivity: TP/(TP+FN)
# 1-Specificity: FP/(TN+FP)

ROC_curve_plot <- ROC_curve %>%
  ggplot(aes(x=one_minus_specificity, y=sensitivity)) + 
  geom_line() + geom_point() +
  geom_abline(slope=1, intercept=0, linetype='dashed', color='red') +
  xlab("one_minus_specificity\n(false positive rate)") +
  ggtitle('ROC curve')

ggplotly(ROC_curve_plot)

temp = predict(classifier, type='prob') %>% yardstick::roc_auc(truth=train$BlueTarp_rest, Up) 

glm.Accuracy = accuracy
glm.AUC = temp$.estimate
twoclassSum = twoClassSummary( data = data.frame(obs = classifier$pred$obs,
  pred = classifier$pred$pred,
  Down = classifier$pred$Down, 
  Up = classifier$pred$Up), lev = levels(classifier$pred$obs))

glm.ROC = "TRUE" #unname(twoclassSum[1])
glm.Threshold = opt_thresh
glm.Sensitivity = glm.Recall= glm.Power =  TPR
glm.Specificity= 1-FPR
glm.FDR = FDR
glm.Precision=PPV

df = data.frame(c(glm.Accuracy, glm.AUC, glm.ROC, glm.Threshold, glm.Sensitivity, glm.Specificity, glm.FDR , glm.Precision))

colnames(df) <- c("Logistic Regression")
rownames(df) <- c("Accuracy","AUC", "ROC", "Threshold","Sensitivity=Recall=Power","Specificity=1-FPR","FDR","Precision=PPV")
df %>% kbl() %>% kable_styling()

	Logistic Regression
Accuracy	0.990845348813016
AUC	0.998727855363222
ROC	TRUE
Threshold	0.1
Sensitivity=Recall=Power	0.964355152300713
Specificity=1-FPR	0.991723989681857
FDR	0.205552589428724
Precision=PPV	0.794447410571276

Conclusions

df = data.frame(c(knn.Accuracy, knn.AUC, knn.ROC, knn.Threshold, knn.Sensitivity, knn.Specificity, knn.FDR , knn.Precision),
                c(lda.Accuracy, lda.AUC, lda.ROC, lda.Threshold, lda.Sensitivity, lda.Specificity, lda.FDR , lda.Precision),
                c(qda.Accuracy, qda.AUC, qda.ROC, qda.Threshold, qda.Sensitivity, qda.Specificity, qda.FDR , qda.Precision), 
                c(glm.Accuracy, glm.AUC, glm.ROC, glm.Threshold, glm.Sensitivity, glm.Specificity, glm.FDR , glm.Precision))


colnames(df) <- c(paste0("KNN (k =",bestk,")"), "LDA","QDA","Logistic Regression")
rownames(df) <- c("Accuracy","AUC", "ROC", "Threshold","Sensitivity=Recall=Power","Specificity=1-FPR","FDR","Precision=PPV")
df %>% kbl() %>% kable_styling()

	KNN (k =11)	LDA	QDA	Logistic Regression
Accuracy	0.996026049143832	0.982606162744731	0.990241974075692	0.990845348813016
AUC	0.999812734291107	0.989589213539748	0.998273998625808	0.998727855363222
ROC	TRUE	TRUE	TRUE	TRUE
Threshold	0.1	0.1	0.1	0.1
Sensitivity=Recall=Power	0.985093972780298	0.856124432922878	0.908619572261828	0.964355152300713
Specificity=1-FPR	0.996388650042992	0.986801375752365	0.992949269131556	0.991723989681857
FDR	0.0995260663507109	0.317312661498708	0.189595375722543	0.205552589428724
Precision=PPV	0.900473933649289	0.682687338501292	0.810404624277457	0.794447410571276

In order to determine the the overall optimal classification method, we’ll use the scoring system of 5 for first place, 3 for second, 2 for third, and 1 for fourth to rank each model by select various statistics.

If we look at the statistics provided, accuracy is a measure of \(\frac{True Positive + TrueNegative}{TotalPopulation}\). So the optimal model should be one that has highest accuracy

KNN (k = 11 )	LDA	QDA	Logistic Regression
5	1	2	3
————-	————-	————-	————-
5	1	2	3

AUC, or Area Under Curve, is a measure of performance across all possible classification thresholds. Lowering the classification threshold classifies more items as positive. As such, you’ll want a model with a high AUC value.

KNN (k = 11 )	LDA	QDA	Logistic Regression
5	1	2	3
————-	————-	————-	————-
10	2	4	6

Using the same threshold of 0.1 across the same training set, we can compare the sensitivities and Specificities.

Sensitivity is defined as \(\frac{True Positive}{TruePositive + FalseNegative}\). Specificity is defined as \(\frac{True Negative}{FalsePositive + TrueNegative}\).

Both of which, you’d want to have high values in order to correctly predict observations.

KNN (k = 11 )	LDA	QDA	Logistic Regression
5	1	2	3
————-	————-	————-	————-
15	3	6	9

KNN (k = 11 )	LDA	QDA	Logistic Regression
5	1	3	2
————-	————-	————-	————-
20	4	9	11

The False Discovery Rate (FDR) is the expected proportion of type I errors ( you get a false positive ). The optimal model is one with a minimum value for FDR.

KNN (k = 11 )	LDA	QDA	Logistic Regression
5	1	3	2
————-	————-	————-	————-
25	5	12	13

Precision is a measure of \(\frac{True Positive}{FalsePositive + TruePositive}\). Ideally, you’ll want a high value of precision.

KNN (k = 11 )	LDA	QDA	Logistic Regression
5	1	3	2
————-	————-	————-	————-
30	6	15	15

The determination of which algorithm works best KNN was shown to be the optimal classifier, followed by logistic regression / quadratic discriminant analysis and lastly linear discriminant analysis. Although KNN performed the best, the logistic regression / quadratic discriminant analysis models proved to have a comparative accuracy and precision. These models have higher FDR, but this is due to the fact of the selected threshold. With QDA and Logistic Regression performing better than LDA, we can make the assumption that the data has a non-linear boundary between classes. For much more complicated decision boundaries, a non-parametric approach such as KNN can be superior. In addition, the KNN model also supports this assumption since KNN typically performs better for non-linear data classification. When the true decision boundaries are linear, then the LDA and logistic regression approaches will tend to perform well. When the boundaries are moderately non-linear, QDA may give better results.

An additional recommended action that might be taken to improve results is to select a threshold that may be slightly more precise and compared for each classifier. As such, I opted to use the \(F_1\) Score since it can be used to measure the performance of the unbalanced classification data set and cross referenced that with the use of the geometric mean since it can “indicate a poor performance in the classification of the positive cases even if the negative cases are correctly classified as such” (Akosa). The more detailed thresholds appeared to be less than 0.1, however you lose some flexibility in applying future data to the generated classifiers.

The variables associated with this data set are additive colors of the RBG color spectrum. This data formulation allows us to address it with predictive modeling tools since the variables stem from the data being an unbalanced binary classification ( an unequal distribution of classes in the training dataset). Since the goal was to find an algorithm that could effectively search the images in order to locate displaced persons and communicate those locations rescue workers so they could help those who needed it in time, the work show above can be used to find these persons and administer aide. Due to limited resources, it would not be feasible to save everyone, but adjusting the threshold would allow us to save more persons than leaving it as default.

In regards to software aspects, we expect knn to take the longest performance. The time required to do 10-fold cross validation adjusted for using tuning parameters far exceed those for LDA, GDA and GLM. However it is worth noting, that when running KNN for 1 singular value of K, the performance time been the different classification models is negligible. Furthermore, you may notice that in my confusion matrices I used: caret, yardstick and mean. All 3 produced nearly the same number of True Positives at a given threshold yet a few instances will vary. This may be due to how the underlining architecture handles floating point precision and random number generations ( especially if any step re-randomizes the training data). My calculations tend to have a large number of significant digits but if rounding to a lesser number of significant digits, the values are comparable. Also, since the goal of the analysis was to increase the number of True Positives with changes to threshold, was deemed a fair trade-off if we raised the number of these in each model.

Final Part 1

Akeem Wells

10/5/2020

Introduction

Setup

Analysis

KNN

Summary Stats K Nearest Neighbor

LDA

Summary Stats LDA

QDA

Summary Stats QDA

Logistic Regression

Summary Stats Logistic Regression

Conclusions

References