The objective of this study is to present a machine learning model based on Random Forests that allows predicting the number of deceased and recovered patients who have contracted the influenza A (H7N9) virus in China during 2013. To do this, I used a data warehouse with information about 134 patients, of which 31 are known to have died, 46 have recovered and 57 have an unknown prognosis. For the development of this model, the R programming language, RStudio software, and libraries were used, among which ggplot2 and caret stand out.

Introduction

The influenza A(H7N9) virus is part of a subgroup of influenza viruses that normally circulate in birds. Now it’s producing infections human beings, a phenomenon that until recently had not been observed. Existing information on the scope of the disease caused by this virus and on the source of exposure is scarce. the disease is worrying because it has been serious in most the cases. At the moment there are no indications that can be transmitted from person to person, but they are actively investigating transmission routes both from animals to people and from person to person (World Health Organization, 2017). Data mining is a field of statistics and computer science referred to the process that attempt to discover patterns in large volumes of data sets. Use the methods of artificial intelligence, machine learning, statistics and database systems. The objective overview of the data mining process consists of extract information from a data set transform it into an understandable structure for your later use. Through predictive methods used in mining of data, it is possible to obtain forecasts of viral diseases in population groups defined.

Case study

The general objective of this research is to obtain an optimal model that allows generating forecasts mortality and recovery in carrier patients of the influenza A-H7N9 virus. The specific objectives are the following:

  • Carry out an exploratory analysis of the data based in the study of some trend measures central (mean, median, and quartiles)
  • Develop, train, and test a random forest model.

Methodology

Through the R Studio software, a exploratory data analysis, based on the study of measures of central tendency (mean, median and quartiles). Measures of central tendency are measures statistics that seek to summarize in a single value to a set of values. They represent a center which the data set is located. (Quevedo, 2011).

Subsequently, through the “caret” library of R, build and evaluate models from the random forest predictive methods.

Some parameters generated by the used models are the following:

  • Mtry: the algorithm will select the mtry number of predictors to attempt a split for classification when constructing a classification tree.
  • Accuracy: the accuracy of the predictor refers to how well a given predictor can guess the predicted attribute value for a new data.
  • Kappa: is a metric that compares a observed precision with an expected precision (random probability). It is used not only for evaluate a single classifier, but also to evaluate classifiers against each other. Also has taking into account random chance (according to a random classifier), which is usually means that it is less misleading than simply use precision as a metric. Landis and Koch consider in their investigations the values ​​0-0.20 as mild, 0.21-0.40 as regular, 0.41-0.60 as moderate, 0.61-0.80 as substantial and 0.81-1 like almost perfect.

Kappa = $\frac{observed precision – expected precision}{1 – expected precision}$

The Random Forest method is a combination of predictor trees such that each tree depends on the values of a random vector tested independently and with the same distribution for each one of these. It is a substantial modification of bagging that builds a long collection of trees uncorrelated and then averages them. (Breimann L, 2001).

Application

Exploratory analysis of data

The file to analyze is a .csv with the data of 134 h7n9 virus patients, of whom 31 are known to have died, 46 have recovered and 57 have a unknown prognosis. The variables that make up the dataset are the following:

  • case_id: Patient identifier.
  • outcome: Prognosis if any (recovered or deceased).
  • age: Age of the individual.
  • male: Gender (1 = masculine, 0 = feminine)
  • hospital: Boolean data that indicates if the patient has been hospitalized
  • days_to_hospital: number of days elapsed between the onset of the disease and hospitalization
  • days_to_outcome: number of days elapsed between the beginning and the end of the disease.
  • early_outcome: Indicates if the disease has lasted less time than the average in the dataset.
  • Jiangsu, Shanghai, Zhejiang, Other: Variables booleans that indicate the place of origin of the patient.

For this analysis, we start by importing the libraries that we will be using.

In [ ]:

library(dplyr)
library(readr)
library(tidyr)
library(ggplot2)
library(caret)
library(gbm)
library(rpart)
library(rattle)
library(rpart.plot)
library(RColorBrewer)

Here we can see the full dataset.

In [15]:

h7n9 <- read.csv("../input/chinah7n9/h7n9.csv")
h7n9
case_idoutcomeagemalehospitaldays_to_hospitaldays_to_outcomeearly_outcomeJiangsuOtherShanghaiZhejiang
<chr><chr><int><int><int><int><int><int><int><int><int><int>
case_1Death581041310010
case_2Death71141110010
case_3Death1101103110100
case_4NA180184601000
case_5Recover2001115701000
case_6Death90173611000
case_7Death541192011000
case_8Death1411112010001
case_9NA391101800001
case_10Death20114610010
case_11Death36112610001
case_12Death24006710010
case_13Death390131210010
case_14Recover151041010010
case_15NA3400113811000
case_16NA511032001000
case_17Death461161410010
case_18Recover381142010010
case_19Death311156700010
case_20Recover271142210100
case_21Recover391112310010
case_22NA561141701000
case_23Recover50104601000
case_24Death36106610010
case_25Recover351103500010
case_26Death491141110010
case_27Recover2301273710001
case_28NA51106610001
case_29Recover480143200010
case_30Recover530062300010
case_107Death611072100100
case_108NA551032200001
case_109NA35100800001
case_110NA251143200100
case_111Death281042200100
case_112NA411011100100
case_113NA330071310001
case_114NA220010800001
case_115NA141071000001
case_116Recover371152100100
case_117Recover480042800100
case_118NA211065701000
case_119Recover121062601000
case_120NA331033801000
case_121Death360153000100
case_122NA141051800001
case_123Death261171600100
case_124Recover521071700100
case_125Recover80002000100
case_126NA5211103100100
case_127Recover15105900100
case_128Recover301172200100
case_129Recover411142800100
case_130NA411112000100
case_131Recover60111210100
case_132NA510102400100
case_133Recover32110200100
case_134Recover2111701000
case_135Death340133200100
case_136NA230111300100

Below is a summary of the data categorized by age:

In [16]:

summary(h7n9$age)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   2.00   20.25   34.00   32.46   44.75   61.00 

Let’s build some charts to get more insights from the data.

In [17]:

plot(density(h7n9$age), 
     main = "Density histogram",
     xlab = "Age (years old)",
     ylab = "Density")

In [18]:

ggplot(h7n9, aes(age)) + geom_density(aes(fill=outcome), alpha=1/3)

In [19]:

summary(h7n9['age'])
boxplot(h7n9['age'], main = "Distribution of patients by age",
        xlab = "Age",
        ylab = "Patients",
        col = "orange",
        border = "brown",
        horizontal = TRUE,
        notch = TRUE)
      age       
 Min.   : 2.00  
 1st Qu.:20.25  
 Median :34.00  
 Mean   :32.46  
 3rd Qu.:44.75  
 Max.   :61.00  

In [20]:

summary(h7n9[h7n9$outcome=='Death', 'age'])
boxplot(h7n9[h7n9$outcome=='Death', 'age'], main = "Distribution of deceased patients by age",
        xlab = "Age",
        ylab = "Patients",
        col = "orange",
        border = "brown",
        horizontal = TRUE,
        notch = TRUE)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   7.00   27.50   36.00   37.19   49.00   61.00      57 

In [21]:

summary(h7n9[h7n9$outcome=='Recover', 'age'])
boxplot(h7n9[h7n9$outcome=='Recover', 'age'], main = "Distribution of recovered patients by age",
        xlab = "Age",
        ylab = "Patients",
        col = "orange",
        border = "brown",
        horizontal = TRUE,
        notch = TRUE)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   2.00   13.25   26.00   26.89   39.75   60.00      57 

In [22]:

summary(h7n9[is.na(h7n9$outcome), 'age'])
boxplot(h7n9[is.na(h7n9$outcome), 'age'], main = "Distribution of patients without prognosis by age",
        xlab = "Age",
        ylab = "Patients",
        col = "orange",
        border = "brown",
        horizontal = TRUE,
        notch = TRUE)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   6.00   26.00   35.00   34.39   44.00   56.00 

In the distribution of all patients, age minimum is 2 years, the average age is 32.46 years and the maximum age is 61 years.

In the distribution of deceased patients, age minimum is 7 years, the average age is 37.19 years and the maximum age is 61 years.

In the distribution of recovered patients, the minimum age is 2 years, average age is 26.89 years and the maximum age is 60 years.

In the distribution of patients with a prognosis unknown, the minimum age is 6 years, the age average is 34.39 years and the maximum age is 56 years.

We can see that the average age of the deceased patients is older than the average age of recovered patients, so we can infer that age is an important variable when establish predictive models on this set of data. To reach this conclusion, it is not necessary to data transformation because it is not being using no parametric analysis method for this (regression, student’s t, correlation, ANOVA, etc).

Before applying the predictive methods, divide the data in training and test groups. The test data is composed of the 57 cases with unknown prognosis. The training data is split for validation of models: 70% of training data will be kept for the construction of the model and the The remaining 30% will be used for model testing. This ratio was used because of the importance is to use most of the data to train the model, and at the same time leave a proportion significant to run the tests.

In [23]:

unknown_index <- which(is.na(h7n9$outcome))
unknown_data = h7n9[unknown_index, ]
train_data <- h7n9[-unknown_index, ][,-1]

set.seed(1275)
val_index <- createDataPartition(train_data$outcome, p = 0.7, list=FALSE) # training data indices
val_train_data <- train_data[val_index, ] # training data
val_test_data  <- train_data[-val_index, ] # test data

Random Forest

This model has an accuracy of 76.73% with the following parameters:

  • mtry = 10
  • kappa = 0.4970

Obtaining the confusion matrix from prediction made, 14 results recorded hits and 8 wrong results, which represents an accuracy of 63.64%. The most important variable for the model is the age.

In [24]:

model_rf <- caret::train(outcome ~ .,
                         data = val_train_data,
                         method = "rf",
                         preProcess = NULL,
                         trControl = trainControl(method = "repeatedcv", number = 10, repeats = 10, verboseIter = FALSE))

model_rf
Random Forest 

55 samples
10 predictors
 2 classes: 'Death', 'Recover' 

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 10 times) 
Summary of sample sizes: 49, 50, 50, 49, 49, 50, ... 
Resampling results across tuning parameters:

  mtry  Accuracy   Kappa    
   2    0.6804762  0.2994427
   6    0.7564286  0.4744705
  10    0.7585238  0.4872897

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mtry = 10.

In [25]:

confusionMatrix(predict(model_rf, val_test_data), as.factor(val_test_data$outcome))
Confusion Matrix and Statistics

          Reference
Prediction Death Recover
   Death       3       2
   Recover     6      11

               Accuracy : 0.6364         
                 95% CI : (0.4066, 0.828)
    No Information Rate : 0.5909         
    P-Value [Acc > NIR] : 0.4195         

                  Kappa : 0.1927         

 Mcnemar's Test P-Value : 0.2888         

            Sensitivity : 0.3333         
            Specificity : 0.8462         
         Pos Pred Value : 0.6000         
         Neg Pred Value : 0.6471         
             Prevalence : 0.4091         
         Detection Rate : 0.1364         
   Detection Prevalence : 0.2273         
      Balanced Accuracy : 0.5897         

       'Positive' Class : Death          
                                         

In [26]:

varImp(model_rf, scale=TRUE) # Importance of the variable
varImp(model_rf, scale=TRUE) %>% plot()
rf variable importance

                  Overall
age              100.0000
days_to_outcome   35.4028
days_to_hospital  32.1660
early_outcome      8.8473
Other              4.9845
male               2.6046
hospital           2.5463
Shanghai           0.7959
Zhejiang           0.1249
Jiangsu            0.0000

In [27]:

predict(model_rf, newdata = unknown_data)  # Prediction

new_h7n9 = unknown_data # Include results in dataset
new_h7n9 %>%
  mutate(outcome=predict(model_rf, newdata=unknown_data))
  1. Recover
  2. Recover
  3. Death
  4. Recover
  5. Death
  6. Death
  7. Recover
  8. Recover
  9. Recover
  10. Recover
  11. Death
  12. Death
  13. Death
  14. Recover
  15. Recover
  16. Death
  17. Recover
  18. Death
  19. Death
  20. Death
  21. Recover
  22. Recover
  23. Recover
  24. Recover
  25. Death
  26. Death
  27. Death
  28. Recover
  29. Recover
  30. Death
  31. Recover
  32. Recover
  33. Recover
  34. Death
  35. Recover
  36. Recover
  37. Recover
  38. Recover
  39. Death
  40. Recover
  41. Recover
  42. Death
  43. Recover
  44. Death
  45. Recover
  46. Death
  47. Recover
  48. Death
  49. Recover
  50. Recover
  51. Recover
  52. Death
  53. Recover
  54. Recover
  55. Recover
  56. Recover
  57. Recover

Levels:

  1. ‘Death’
  2. ‘Recover’
case_idoutcomeagemalehospitaldays_to_hospitaldays_to_outcomeearly_outcomeJiangsuOtherShanghaiZhejiang
<chr><fct><int><int><int><int><int><int><int><int><int><int>
4case_4Recover180184601000
9case_9Recover391101800001
15case_15Death3400113811000
16case_16Recover511032001000
22case_22Death561141701000
28case_28Death51106610001
31case_31Recover431042101000
32case_32Recover461032001000
38case_38Recover28102711000
39case_39Recover381101800001
40case_40Death461151410001
41case_41Death260162800001
42case_42Death251173800010
47case_47Recover441061601000
48case_48Recover371161700001
52case_52Death3600102010001
54case_54Recover47100800001
56case_56Death451161710010
62case_62Death400011810001
63case_63Death33107210100
66case_66Recover44100211000
67case_67Recover281021000001
68case_68Recover29100600001
69case_69Recover351001700001
70case_70Death300083100001
71case_71Death44008800001
78case_80Death46106700001
82case_84Recover60072211000
83case_85Recover520173810010
84case_86Death260081110001
86case_88Recover151041810100
88case_90Recover171131100001
90case_92Recover380172200001
91case_93Death28105700001
93case_95Recover131022200001
94case_96Recover171042201000
97case_99Recover400081700001
98case_100Recover3010111700001
99case_101Death5100111000001
100case_102Recover53100800001
101case_103Recover401163201000
102case_104Death26007800001
103case_105Recover910111710001
106case_108Death551032200001
107case_109Recover35100800001
108case_110Death251143200100
110case_112Recover411011100100
111case_113Death330071310001
112case_114Recover220010800001
113case_115Recover141071000001
116case_118Recover211065701000
118case_120Death331033801000
120case_122Recover141051800001
124case_126Recover5211103100100
128case_130Recover411112000100
130case_132Recover510102400100
134case_136Recover230111300100

In [28]:

summary(predict(model_rf, newdata = unknown_data))

Death21Recover36

Results

From the data obtained, it can be determined that of the 57 patients with unknown prognosis, 21 will die and 36 will recover, indicating a mortality rate of 36.84% for this set of data.

Regarding the patients who will die, there are the following data:

  • 52.38% (11 patients) are of the gender male, and 47.62% (10 patients) are of the Female gender.
  • 27.27% (6 patients) were hospitalized, while 72.73% (15 patients) were not.
  • 14.28% (3 patients) come from Jiangsu, 66.66% (14 patients) come from Zhenjiang, the 9.52% (2 patients) come from Shanghai and the 9.52% (2 patients) come from other locations.
  • 42.85% (9 patients) had the disease less time than usual, which indicates that They died more quickly.

Regarding the patients who will recover, have the following data:

  • 77.77% (28 patients) are of the gender male, and 22.22% (8 patients) are of the Female gender.
  • 33.33% (12 patients) were hospitalized, while 66.66% (24 patients) were not.
  • 30.55% (11 patients) come from Jiangsu, the 50% (18 patients) come from Zhenjiang, the 2.77% (1 patient) comes from Shanghai and the 16.68% (6 patients) come from other locations.
  • 16.66% (6 patients) had the disease less time than usual, which indicates that They overcame the disease more quickly.

In [29]:

unknown_data$outcome = c('Recover', 'Recover', 'Death', 'Recover', 'Death', 'Death', 'Recover', 'Recover', 'Recover', 'Recover', 'Death', 'Death', 'Death', 'Recover', 
                         'Recover', 'Death', 'Recover', 'Death', 'Death', 'Death', 'Recover', 'Recover', 'Recover', 'Recover', 'Death', 'Death', 'Death', 'Recover', 'Recover', 
                         'Death', 'Recover', 'Recover', 'Recover', 'Death', 'Recover', 'Recover', 'Recover', 'Recover', 'Death', 'Recover', 'Recover', 'Death', 'Recover', 
                         'Death', 'Recover', 'Death', 'Recover', 'Death', 'Recover', 'Recover', 'Recover', 'Death', 'Recover', 'Recover', 'Recover', 'Recover', 'Recover')

par(mfrow=c(1,2))
plot(density(unknown_data$age), main = "Final results of the predictive model",
     xlab = "Forecast", ylab = "Frequency")


hist(unknown_data$age, main = "Histogram of frequencies",
     xlab = "Age",
     ylab = "Frequency",
     col = "red",
     border = "black")

death_male = unknown_data[ which(unknown_data$male=='1' & unknown_data$outcome=='Death'),]
death_female = unknown_data[ which(unknown_data$male=='0' & unknown_data$outcome=='Death'),]

death_hospital = unknown_data[ which(unknown_data$hospital=='1' & unknown_data$outcome=='Death'),]
death_not_hospital = unknown_data[ which(unknown_data$hospital=='0' & unknown_data$outcome=='Death'),]

death_jiangsu = unknown_data[ which(unknown_data$Jiangsu=='1' & unknown_data$outcome=='Death'),]
death_zhejiang = unknown_data[ which(unknown_data$Zhejiang =='1' & unknown_data$outcome=='Death'),]
death_shanghai = unknown_data[ which(unknown_data$Shanghai=='1' & unknown_data$outcome=='Death'),]
death_other = unknown_data[ which(unknown_data$Other=='1' & unknown_data$outcome=='Death'),]

death_early_outcome = unknown_data[ which(unknown_data$early_outcome=='1' & unknown_data$outcome=='Death'),]
death_not_early_outcome = unknown_data[ which(unknown_data$early_outcome=='0' & unknown_data$outcome=='Death'),]

recover_male = unknown_data[ which(unknown_data$male=='1' & unknown_data$outcome=='Recover'),]
recover_female = unknown_data[ which(unknown_data$male=='0' & unknown_data$outcome=='Recover'),]

recover_hospital = unknown_data[ which(unknown_data$hospital=='1' & unknown_data$outcome=='Recover'),]
recover_not_hospital = unknown_data[ which(unknown_data$hospital=='0' & unknown_data$outcome=='Recover'),]

recover_jiangsu = unknown_data[ which(unknown_data$Jiangsu=='1' & unknown_data$outcome=='Recover'),]
recover_zhejiang = unknown_data[ which(unknown_data$Zhejiang =='1' & unknown_data$outcome=='Recover'),]
recover_shanghai = unknown_data[ which(unknown_data$Shanghai=='1' & unknown_data$outcome=='Recover'),]
recover_other = unknown_data[ which(unknown_data$Other=='1' & unknown_data$outcome=='Recover'),]

recover_early_outcome = unknown_data[ which(unknown_data$early_outcome=='1' & unknown_data$outcome=='Recover'),]
recover_not_early_outcome = unknown_data[ which(unknown_data$early_outcome=='0' & unknown_data$outcome=='Recover'),]

count(death_early_outcome)
n
<int>
9

In [33]:

# Pie Chart
x <-  c(22, 35)
labels <-  c("Deceased","Recovered")
piepercent<- round(100*x/sum(x), 1)
pie(x, labels = piepercent, main = "Deaths & Recoveries comparison",col = rainbow(length(x)))
legend("topright", c("Deceased","Recovered"), cex = 0.8,
       fill = rainbow(length(x)))

par(mfrow=c(2,2))

hist(death_male$age, main = "Total deaths by male gender",
     xlab = "Age",
     ylab = "Frequency",
     col = "red",
     border = "black")

hist(death_female$age, main = "Total deaths by female gender",
     xlab = "Age",
     ylab = "Frequency",
     col = "red",
     border = "black")

hist(recover_male$age, main = "Total recovered by male gender",
     xlab = "Age",
     ylab = "Frequency",
     col = "red",
     border = "black")

hist(recover_female$age, main = "Total recovered by female gender",
     xlab = "Age",
     ylab = "Frequency",
     col = "red",
     border = "black")

In [31]:

par(mfrow=c(4,2))
hist(death_jiangsu$age, main = "Total deaths in Jiangsu",
     xlab = "Age",
     ylab = "Frequency",
     col = "red",
     border = "black")

hist(recover_jiangsu$age, main = "Total recovered in Jiangsu",
     xlab = "Age",
     ylab = "Frequency",
     col = "red",
     border = "black")

hist(death_shanghai$age, main = "Total deaths in Shanghai",
     xlab = "Age",
     ylab = "Frequency",
     col = "red",
     border = "black")

hist(recover_shanghai$age, main = "Total recovered in Shanghai",
     xlab = "Age",
     ylab = "Frequency",
     col = "red",
     border = "black")

hist(death_zhejiang$age, main = "Total deaths in Zhejiang",
     xlab = "Age",
     ylab = "Frequency",
     col = "red",
     border = "black")

hist(recover_zhejiang$age, main = "Total recovered in Zhejiang",
     xlab = "Age",
     ylab = "Frequency",
     col = "red",
     border = "black")

hist(death_other$age, main = "Total deaths in other location",
     xlab = "Age",
     ylab = "Frequency",
     col = "red",
     border = "black")

hist(recover_other$age, main = "Total recovered in other location
",
     xlab = "Age",
     ylab = "Frequency",
     col = "red",
     border = "black")

In [ ]: