1. Install Essential Packages

This workshop requires several R packages, but most of the activities will center around the Classification And Regression Training (CARET) package. You can find detailed documentation about this package here: http://topepo.github.io/caret/index.html

To begin, let’s install all the neccesarry pakages we will need for this workshop!

2. Import Libraries

R requires us to add the installed packages to our library, as shown below:

require("RPostgreSQL")
require("ggplot2")
library("gplots")
library("caret")
library("e1071")
library("pROC")
library("MKmisc")
library("ROCR")
library("ResourceSelection")

3. Import The Arterial Line Data Set

This workshop is about predicion. More specifically, we will building an arterial-line (Aline) propensity model. A propensity model is tool that attempts to predict the behaviour of clinicians in the ICU, given access to some relevent features. Propensity models can be used to identify those patients most likly to recieve an Aline prospectively, or to figure out which patients may have recieved a given treatment in retrospect.

To get started, let’s first import the Arterial Line csv:

#Specify the URL that contains the CSV:
url <- "http://physionet.org/physiobank/database/mimic2-iaccd/full_cohort_data.csv"

#And now let's read the data in:
dat <- read.csv(url)

Now that the data is imported, let’s remove any rows with missing data from the dataset.

the_data=na.omit(dat)

4. Our First Propensity Model

The goal of our prepensity model is to evaluate if a given patient will be recieving an Aline, given other information available about the patient upon admission. To build this model, we must first decide on the set of features in our dataset that we think may predict the chances of a patient recieveing the Aline. For instance, we may suspect that:

All impact the probability of recieving an Aline.

We may also believe that the effects from these variables, are linearly related to the probability of an Aline. For example:

Given these assumptions, a Logistic Regression is a reasonable model to deploy for the propensity score. Let’s specify our model.

the_model = aline_flg ~ age +  weight_first + sofa_first

This formula simply says that the we can estiamte the probability of an aline, using a linear combination of patient age, weight, and severity of illness.

Now that we have specified the model, we are ready to do some training, and identify parameters values using our data. Before we do that however, R requires us to convert the outcome variable to a “factor”.

the_data$aline_flg  =  factor(the_data$aline_flg)

In general, it is always reccomended to convert any binary response variables you want to use in your models into factors.

Having cleaned up the data, specified the model, and converted our binary outcome to a factor, we are finally ready to train the model:

the_data.train = train(the_model,
                       data =the_data, 
                       method ="glm", 
                       family="binomial")

The above code block takes the model we speified, our data, and some information about the type of model we are interested in training: in our case a Generalized Linear Model with family set to “binomail” to indicate that we are modeling a binary response variable.

We can look at Odds ratio (the exponential of the model coefficients) to understand what the model learned from the data.

exp(coef(the_data.train$finalModel))
##  (Intercept)          age weight_first   sofa_first 
##    0.3046881    0.9975766    0.9991309    1.4049184

If our goal is to actually use this model in the real world, we would want some way of assessing how well it does in predicting Aline, given our features. One way to check this is to give the model the same data points we used to generate the model in the first place, and to see what it predicts. We would then compare the predictions of the model, to what actually happened, as a way of assessing the model’s performance.

pred = predict(the_data.train, newdata=the_data)

Now that we have the predcitions, we can quickly check the performance of our model using the ConfusionMatrix command in R

confusionMatrix(data=pred, the_data$aline_flg)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 107  71
##          1 264 622
##                                          
##                Accuracy : 0.6852         
##                  95% CI : (0.6563, 0.713)
##     No Information Rate : 0.6513         
##     P-Value [Acc > NIR] : 0.01077        
##                                          
##                   Kappa : 0.2115         
##  Mcnemar's Test P-Value : < 2e-16        
##                                          
##             Sensitivity : 0.2884         
##             Specificity : 0.8975         
##          Pos Pred Value : 0.6011         
##          Neg Pred Value : 0.7020         
##              Prevalence : 0.3487         
##          Detection Rate : 0.1006         
##    Detection Prevalence : 0.1673         
##       Balanced Accuracy : 0.5930         
##                                          
##        'Positive' Class : 0              
## 

Here we are presented with a matrix that tell us how well the classification did, the overall accuracy of the classiification, and our model’s sensitivity/specificity.

Student Exercise 1: Building a Better Model

Train a model that include the following continuous features:

And the following binary features:

#Students, put something here.

5. A Better Way to Assess Our Model

In the previous section, we built a simple logistic regression model that predicted Aline propensity. We then tested the performance of our model on the same dataset we used to indetify the model parameters in the first place.

One flaw of our approach is that we don’t really know how well the model will do on data-points it’s never seen before. That is, all the data-points that we used to assess the model’s performance, were datapoints it had already seen at some point. That’s sort of like testing your understanding of concepts in calculus, by giving you a problem you had already seen the solutions to in a homework assignment. In such a case it would be tough for me to know if you had actually learned the general concepts, or if you simply memorized how to solve the particular problem. So, if I want to test how well you learned calculus, I would need to assess how well you were able to generalize your knowledge from the homework (a training set), to another set of problems that you had never seen before (a testing set). I would then use your performance on the testing set to gauge how well you actually knew the material.

The correct way to assess a machine is the same way we would assess a human being. So, Let’s take our dataset and split it into two pieces: 50% of the data will be used to train the machine, and 50% will be used to test it.

set.seed(988) #set the random seed so that we can regenerate these results.
inTraining = createDataPartition(the_data$aline_flg, p = .5, list = FALSE)

training = the_data[inTraining,]
testing  = the_data[-inTraining,]

Student Exercise 2: Comparing Training and Testing Set Performance

Don’t forget to set the random seed so that we can regenerate your results:

set.seed(998)
#Students, put something here.

6. Moving Beyond Accuracy

In the previous section, we compared the performance of our logistic regression models on the training, and a testing set. Unforuntatly, there are circumstances where Accuracy can be misleading.

Comming back to our classroom analogy, let’s consider that student exams are always structured using true/false questions. Let’s also assume that when I write an exam only 10% of the questions are ever false. If a student somehow learned this about my exam structure, a winning stratedgy would be to always guess true. Then, without even reading any of the questions, a student could obtain a 90% on the exam, and seem like expert when in fact they knew nothing.

One potential way I could solve this problem is by weighting the importance of questions by the prevelence of their answer. For example, if a false question is 10 times less likely then a true question, then it’s also worth ten times as much. You could argue that 10x the credit for a single question is too much! What if some other student always guesses false, answers nearly all the questions incorrectly, but due to my metric ends up with 50% of the exam? Perhaps false questions should only be worth 2x true questions, or maybe 3x? Clearly this situation is one without a straight forward answer.

Returning back to th question of Afib prediction we are faced with the same problem. If we want to assess the performance of our classifier, accuracy may be flawed metric. Furthermore, accuracy assumes that the cost of misclassify an Afib case is the same as classifying a non-afib case. While it would be interesting to debate about what the right costs of misclassification are, a better approach is to simply test all possible costs!

A common way to do this is using the Reciever Operator Curve (ROC).

# TRAINING SET PERFORMANCE:
# Let's Generate the ROC Curve
prob = predict(mod_fit, newdata=the_data, type="prob")
pred = prediction(prob[,2], the_data$aline_flg)
perf = performance(pred, measure = "tpr", x.measure = "fpr")
plot(perf)

Each individual point that makes up the ROC curve illustrates the performance of the classifier assuming a different misclassification cost. At the far left of the x axis, we see the true positive rate of the model if we assume that misclassifications of the positive class is infitely costly.

If we wanted to estimate how well the model does across all classification thresholds, we can look at the area under the reciever operator curve (AUROC). Like a perfect calculus student, a perfect classifier is immune to the effects of miscalssification costs and will have an area of 1 (because they are never wrong, and therefore have no opportunity to be penalized!).

auc <- performance(pred, measure = "auc")
auc <- auc@y.values[[1]]

Student Exercise 3: Comparing Models using AUC

#Students, put something here.

7. Model Calibration

Up until now, our focus has been to build a model that can effectively discriminate between patients that will, and will not recieve an Aline. Discrimination is a useful metric for understanding how well our model can sort the data into two classes, but depending on our goal, the ability of our model to perform a binary classification may not be the only thing we care about. What if we want to gradually adjust our behavior depending on how likly it is that an event will occur? In this case, we would care to know if our model is well-calibrated.

The Speedster's Dillemma

Let’s assume that a calculus student, George, wakes up late for his exam. In a rush to make to the examination hall, he jumps into his car and begins speeding. George knows that the cost of a speeding ticket depends on how far over the speed limit he is driving. So, george decides to adjust how far above the speed limit he is willing to drive, depending on how likly it is that a police officer is nearby.

  • If George is 100% sure there are no officers, he drives as fast as he can!
  • If george is 75% sure there is no officer he will still speed, but probably not as fast.
  • If george is 75% sure there is an officer, he will speed, but only a little.
  • If george is 100% sure there is an officer, he drives at precisely the speed limit.

Let’s say that George decides to train a logistic regression model that predicts the probability of an officer, given some features (the weather, proximity to Dunkin Doghnuts, etc.). To his great pleasure, The AUC of the model is an impressive 0.95, which means that it does an excellent job at discriminating between when a cop is nearby, and when they are not.

As George sits in his car and speeds to the exam, he is slightly confused. The probability output by his model never falls beneath 40% and never goes above 60%. Putting his trust in the AUC however, George decides to drive at twice the speed all the way to the exam hall, and is eventually pulled over by an officer, and taken to jail!

Why did George’s model fail? The reason is because he was attempting to use model with excellent discrimination, but very poor calibration. That is, George’s model may have consistently given a probability near 60% for when a cop was near by, and a probability of 40% for when a cop was definetly not. If we set a discrimination threshold of 50%, it would be easy to use the output of the model to decide wheather or not is present or not, but the probabilities provided by the model are not reliable in and of themself. When we ask, how well a model is calibrated, we are attempted to understand how reliabile the probabilities of the model actually are!

One way to check model calibration is through the Hosmere-Lemeshow test:

hoslem.test(training$aline_flg, fitted(mod_fit), g=10)
## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  training$aline_flg, fitted(mod_fit)
## X-squared = 533, df = 8, p-value < 2.2e-16

Where a p-value greater than 0.05 indicates good model calibration.

8. Cross validation

Earlier in this workshop, we explained the merits of seperating our data into training and testing sets for model validation. In Section 5, the data was split into two equally sized parts - let’s call them datasets A and B. After splitting the data, we used A to train our model, and B to evaluate our performance.

One weakness of our approach was that we didn’t have a fair way of evaluating how well the model would have performed on the data in A, if it had never seen them. Fortunatly, that’s an easy thing to test! We can simply re-train the same model using B, and then evaluate our performance on A. This approach, where we seperate the data into two sections, train a model on each section, and test our performance on the other, is called 2-fold cross validation. If we wanted to check the model’s discriminative performance we could report the mean AUC across the two folds.

Cross Validation doesn’t have to be just 2 folds (A and B). Indeed, we can seperate the data into as many segements as we want and, in general, the more segments we sepeaterate the data into, the better we can validate our model’s generalizability. For instance, we could split our data into 100 segements, and for each individual segment we want to test our model on, we could train our model on the remaining 99 segments. Then, just as we did in 2-fold validation. We can report the average performance across folds as our results.

Fortunatly, doing cross validation in R is very straight forward:

ctrl = trainControl(method = "cv", number = 100, savePredictions = TRUE)
mod_fit = train(the_model,  data=training, method="glm", family="binomial", trControl = ctrl)

Student Question 4: Putting it All Together

#Students, put something here.