Forecasting model for influenza A cases (H7N9) based on Random Forests
The objective of this study is to present a machine learning model based on Random Forests that allows predicting the number of deceased and recovered patients who have contracted the influenza A (H7N9) virus in China during 2013. To do this, I used a data warehouse with information about 134 patients, of which 31 are known to have died, 46 have recovered and 57 have an unknown prognosis. For the development of this model, the R programming language, RStudio software, and libraries were used, among which ggplot2 and caret stand out.
Introduction
The influenza A(H7N9) virus is part of a subgroup of influenza viruses that normally circulate in birds. Now it’s producing infections human beings, a phenomenon that until recently had not been observed. Existing information on the scope of the disease caused by this virus and on the source of exposure is scarce. the disease is worrying because it has been serious in most the cases. At the moment there are no indications that can be transmitted from person to person, but they are actively investigating transmission routes both from animals to people and from person to person (World Health Organization, 2017). Data mining is a field of statistics and computer science referred to the process that attempt to discover patterns in large volumes of data sets. Use the methods of artificial intelligence, machine learning, statistics and database systems. The objective overview of the data mining process consists of extract information from a data set transform it into an understandable structure for your later use. Through predictive methods used in mining of data, it is possible to obtain forecasts of viral diseases in population groups defined.
Case study
The general objective of this research is to obtain an optimal model that allows generating forecasts mortality and recovery in carrier patients of the influenza A-H7N9 virus. The specific objectives are the following:
- Carry out an exploratory analysis of the data based in the study of some trend measures central (mean, median, and quartiles)
- Develop, train, and test a random forest model.
Methodology
Through the R Studio software, a exploratory data analysis, based on the study of measures of central tendency (mean, median and quartiles). Measures of central tendency are measures statistics that seek to summarize in a single value to a set of values. They represent a center which the data set is located. (Quevedo, 2011).
Subsequently, through the “caret” library of R, build and evaluate models from the random forest predictive methods.
Some parameters generated by the used models are the following:
- Mtry: the algorithm will select the mtry number of predictors to attempt a split for classification when constructing a classification tree.
- Accuracy: the accuracy of the predictor refers to how well a given predictor can guess the predicted attribute value for a new data.
- Kappa: is a metric that compares a observed precision with an expected precision (random probability). It is used not only for evaluate a single classifier, but also to evaluate classifiers against each other. Also has taking into account random chance (according to a random classifier), which is usually means that it is less misleading than simply use precision as a metric. Landis and Koch consider in their investigations the values 0-0.20 as mild, 0.21-0.40 as regular, 0.41-0.60 as moderate, 0.61-0.80 as substantial and 0.81-1 like almost perfect.
Kappa = $\frac{observed precision – expected precision}{1 – expected precision}$
The Random Forest method is a combination of predictor trees such that each tree depends on the values of a random vector tested independently and with the same distribution for each one of these. It is a substantial modification of bagging that builds a long collection of trees uncorrelated and then averages them. (Breimann L, 2001).
Application
Exploratory analysis of data
The file to analyze is a .csv with the data of 134 h7n9 virus patients, of whom 31 are known to have died, 46 have recovered and 57 have a unknown prognosis. The variables that make up the dataset are the following:
- case_id: Patient identifier.
- outcome: Prognosis if any (recovered or deceased).
- age: Age of the individual.
- male: Gender (1 = masculine, 0 = feminine)
- hospital: Boolean data that indicates if the patient has been hospitalized
- days_to_hospital: number of days elapsed between the onset of the disease and hospitalization
- days_to_outcome: number of days elapsed between the beginning and the end of the disease.
- early_outcome: Indicates if the disease has lasted less time than the average in the dataset.
- Jiangsu, Shanghai, Zhejiang, Other: Variables booleans that indicate the place of origin of the patient.
For this analysis, we start by importing the libraries that we will be using.
In [ ]:
library(dplyr) library(readr) library(tidyr) library(ggplot2) library(caret) library(gbm) library(rpart) library(rattle) library(rpart.plot) library(RColorBrewer)
Here we can see the full dataset.
In [15]:
h7n9 <- read.csv("../input/chinah7n9/h7n9.csv") h7n9
case_id | outcome | age | male | hospital | days_to_hospital | days_to_outcome | early_outcome | Jiangsu | Other | Shanghai | Zhejiang |
---|---|---|---|---|---|---|---|---|---|---|---|
<chr> | <chr> | <int> | <int> | <int> | <int> | <int> | <int> | <int> | <int> | <int> | <int> |
case_1 | Death | 58 | 1 | 0 | 4 | 13 | 1 | 0 | 0 | 1 | 0 |
case_2 | Death | 7 | 1 | 1 | 4 | 11 | 1 | 0 | 0 | 1 | 0 |
case_3 | Death | 11 | 0 | 1 | 10 | 31 | 1 | 0 | 1 | 0 | 0 |
case_4 | NA | 18 | 0 | 1 | 8 | 46 | 0 | 1 | 0 | 0 | 0 |
case_5 | Recover | 20 | 0 | 1 | 11 | 57 | 0 | 1 | 0 | 0 | 0 |
case_6 | Death | 9 | 0 | 1 | 7 | 36 | 1 | 1 | 0 | 0 | 0 |
case_7 | Death | 54 | 1 | 1 | 9 | 20 | 1 | 1 | 0 | 0 | 0 |
case_8 | Death | 14 | 1 | 1 | 11 | 20 | 1 | 0 | 0 | 0 | 1 |
case_9 | NA | 39 | 1 | 1 | 0 | 18 | 0 | 0 | 0 | 0 | 1 |
case_10 | Death | 20 | 1 | 1 | 4 | 6 | 1 | 0 | 0 | 1 | 0 |
case_11 | Death | 36 | 1 | 1 | 2 | 6 | 1 | 0 | 0 | 0 | 1 |
case_12 | Death | 24 | 0 | 0 | 6 | 7 | 1 | 0 | 0 | 1 | 0 |
case_13 | Death | 39 | 0 | 1 | 3 | 12 | 1 | 0 | 0 | 1 | 0 |
case_14 | Recover | 15 | 1 | 0 | 4 | 10 | 1 | 0 | 0 | 1 | 0 |
case_15 | NA | 34 | 0 | 0 | 11 | 38 | 1 | 1 | 0 | 0 | 0 |
case_16 | NA | 51 | 1 | 0 | 3 | 20 | 0 | 1 | 0 | 0 | 0 |
case_17 | Death | 46 | 1 | 1 | 6 | 14 | 1 | 0 | 0 | 1 | 0 |
case_18 | Recover | 38 | 1 | 1 | 4 | 20 | 1 | 0 | 0 | 1 | 0 |
case_19 | Death | 31 | 1 | 1 | 5 | 67 | 0 | 0 | 0 | 1 | 0 |
case_20 | Recover | 27 | 1 | 1 | 4 | 22 | 1 | 0 | 1 | 0 | 0 |
case_21 | Recover | 39 | 1 | 1 | 1 | 23 | 1 | 0 | 0 | 1 | 0 |
case_22 | NA | 56 | 1 | 1 | 4 | 17 | 0 | 1 | 0 | 0 | 0 |
case_23 | Recover | 5 | 0 | 1 | 0 | 46 | 0 | 1 | 0 | 0 | 0 |
case_24 | Death | 36 | 1 | 0 | 6 | 6 | 1 | 0 | 0 | 1 | 0 |
case_25 | Recover | 35 | 1 | 1 | 0 | 35 | 0 | 0 | 0 | 1 | 0 |
case_26 | Death | 49 | 1 | 1 | 4 | 11 | 1 | 0 | 0 | 1 | 0 |
case_27 | Recover | 23 | 0 | 1 | 27 | 37 | 1 | 0 | 0 | 0 | 1 |
case_28 | NA | 51 | 1 | 0 | 6 | 6 | 1 | 0 | 0 | 0 | 1 |
case_29 | Recover | 48 | 0 | 1 | 4 | 32 | 0 | 0 | 0 | 1 | 0 |
case_30 | Recover | 53 | 0 | 0 | 6 | 23 | 0 | 0 | 0 | 1 | 0 |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
case_107 | Death | 61 | 1 | 0 | 7 | 21 | 0 | 0 | 1 | 0 | 0 |
case_108 | NA | 55 | 1 | 0 | 3 | 22 | 0 | 0 | 0 | 0 | 1 |
case_109 | NA | 35 | 1 | 0 | 0 | 8 | 0 | 0 | 0 | 0 | 1 |
case_110 | NA | 25 | 1 | 1 | 4 | 32 | 0 | 0 | 1 | 0 | 0 |
case_111 | Death | 28 | 1 | 0 | 4 | 22 | 0 | 0 | 1 | 0 | 0 |
case_112 | NA | 41 | 1 | 0 | 1 | 11 | 0 | 0 | 1 | 0 | 0 |
case_113 | NA | 33 | 0 | 0 | 7 | 13 | 1 | 0 | 0 | 0 | 1 |
case_114 | NA | 22 | 0 | 0 | 10 | 8 | 0 | 0 | 0 | 0 | 1 |
case_115 | NA | 14 | 1 | 0 | 7 | 10 | 0 | 0 | 0 | 0 | 1 |
case_116 | Recover | 37 | 1 | 1 | 5 | 21 | 0 | 0 | 1 | 0 | 0 |
case_117 | Recover | 48 | 0 | 0 | 4 | 28 | 0 | 0 | 1 | 0 | 0 |
case_118 | NA | 21 | 1 | 0 | 6 | 57 | 0 | 1 | 0 | 0 | 0 |
case_119 | Recover | 12 | 1 | 0 | 6 | 26 | 0 | 1 | 0 | 0 | 0 |
case_120 | NA | 33 | 1 | 0 | 3 | 38 | 0 | 1 | 0 | 0 | 0 |
case_121 | Death | 36 | 0 | 1 | 5 | 30 | 0 | 0 | 1 | 0 | 0 |
case_122 | NA | 14 | 1 | 0 | 5 | 18 | 0 | 0 | 0 | 0 | 1 |
case_123 | Death | 26 | 1 | 1 | 7 | 16 | 0 | 0 | 1 | 0 | 0 |
case_124 | Recover | 52 | 1 | 0 | 7 | 17 | 0 | 0 | 1 | 0 | 0 |
case_125 | Recover | 8 | 0 | 0 | 0 | 20 | 0 | 0 | 1 | 0 | 0 |
case_126 | NA | 52 | 1 | 1 | 10 | 31 | 0 | 0 | 1 | 0 | 0 |
case_127 | Recover | 15 | 1 | 0 | 5 | 9 | 0 | 0 | 1 | 0 | 0 |
case_128 | Recover | 30 | 1 | 1 | 7 | 22 | 0 | 0 | 1 | 0 | 0 |
case_129 | Recover | 41 | 1 | 1 | 4 | 28 | 0 | 0 | 1 | 0 | 0 |
case_130 | NA | 41 | 1 | 1 | 1 | 20 | 0 | 0 | 1 | 0 | 0 |
case_131 | Recover | 60 | 1 | 1 | 1 | 2 | 1 | 0 | 1 | 0 | 0 |
case_132 | NA | 51 | 0 | 1 | 0 | 24 | 0 | 0 | 1 | 0 | 0 |
case_133 | Recover | 32 | 1 | 1 | 0 | 2 | 0 | 0 | 1 | 0 | 0 |
case_134 | Recover | 2 | 1 | 1 | 1 | 7 | 0 | 1 | 0 | 0 | 0 |
case_135 | Death | 34 | 0 | 1 | 3 | 32 | 0 | 0 | 1 | 0 | 0 |
case_136 | NA | 23 | 0 | 1 | 1 | 13 | 0 | 0 | 1 | 0 | 0 |
Below is a summary of the data categorized by age:
In [16]:
summary(h7n9$age)
Min. 1st Qu. Median Mean 3rd Qu. Max. 2.00 20.25 34.00 32.46 44.75 61.00
Let’s build some charts to get more insights from the data.
In [17]:
plot(density(h7n9$age), main = "Density histogram", xlab = "Age (years old)", ylab = "Density")
In [18]:
ggplot(h7n9, aes(age)) + geom_density(aes(fill=outcome), alpha=1/3)
In [19]:
summary(h7n9['age']) boxplot(h7n9['age'], main = "Distribution of patients by age", xlab = "Age", ylab = "Patients", col = "orange", border = "brown", horizontal = TRUE, notch = TRUE)
age Min. : 2.00 1st Qu.:20.25 Median :34.00 Mean :32.46 3rd Qu.:44.75 Max. :61.00
In [20]:
summary(h7n9[h7n9$outcome=='Death', 'age']) boxplot(h7n9[h7n9$outcome=='Death', 'age'], main = "Distribution of deceased patients by age", xlab = "Age", ylab = "Patients", col = "orange", border = "brown", horizontal = TRUE, notch = TRUE)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 7.00 27.50 36.00 37.19 49.00 61.00 57
In [21]:
summary(h7n9[h7n9$outcome=='Recover', 'age']) boxplot(h7n9[h7n9$outcome=='Recover', 'age'], main = "Distribution of recovered patients by age", xlab = "Age", ylab = "Patients", col = "orange", border = "brown", horizontal = TRUE, notch = TRUE)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 2.00 13.25 26.00 26.89 39.75 60.00 57
In [22]:
summary(h7n9[is.na(h7n9$outcome), 'age']) boxplot(h7n9[is.na(h7n9$outcome), 'age'], main = "Distribution of patients without prognosis by age", xlab = "Age", ylab = "Patients", col = "orange", border = "brown", horizontal = TRUE, notch = TRUE)
Min. 1st Qu. Median Mean 3rd Qu. Max. 6.00 26.00 35.00 34.39 44.00 56.00
In the distribution of all patients, age minimum is 2 years, the average age is 32.46 years and the maximum age is 61 years.
In the distribution of deceased patients, age minimum is 7 years, the average age is 37.19 years and the maximum age is 61 years.
In the distribution of recovered patients, the minimum age is 2 years, average age is 26.89 years and the maximum age is 60 years.
In the distribution of patients with a prognosis unknown, the minimum age is 6 years, the age average is 34.39 years and the maximum age is 56 years.
We can see that the average age of the deceased patients is older than the average age of recovered patients, so we can infer that age is an important variable when establish predictive models on this set of data. To reach this conclusion, it is not necessary to data transformation because it is not being using no parametric analysis method for this (regression, student’s t, correlation, ANOVA, etc).
Before applying the predictive methods, divide the data in training and test groups. The test data is composed of the 57 cases with unknown prognosis. The training data is split for validation of models: 70% of training data will be kept for the construction of the model and the The remaining 30% will be used for model testing. This ratio was used because of the importance is to use most of the data to train the model, and at the same time leave a proportion significant to run the tests.
In [23]:
unknown_index <- which(is.na(h7n9$outcome)) unknown_data = h7n9[unknown_index, ] train_data <- h7n9[-unknown_index, ][,-1] set.seed(1275) val_index <- createDataPartition(train_data$outcome, p = 0.7, list=FALSE) # training data indices val_train_data <- train_data[val_index, ] # training data val_test_data <- train_data[-val_index, ] # test data
Random Forest
This model has an accuracy of 76.73% with the following parameters:
- mtry = 10
- kappa = 0.4970
Obtaining the confusion matrix from prediction made, 14 results recorded hits and 8 wrong results, which represents an accuracy of 63.64%. The most important variable for the model is the age.
In [24]:
model_rf <- caret::train(outcome ~ ., data = val_train_data, method = "rf", preProcess = NULL, trControl = trainControl(method = "repeatedcv", number = 10, repeats = 10, verboseIter = FALSE)) model_rf
Random Forest 55 samples 10 predictors 2 classes: 'Death', 'Recover' No pre-processing Resampling: Cross-Validated (10 fold, repeated 10 times) Summary of sample sizes: 49, 50, 50, 49, 49, 50, ... Resampling results across tuning parameters: mtry Accuracy Kappa 2 0.6804762 0.2994427 6 0.7564286 0.4744705 10 0.7585238 0.4872897 Accuracy was used to select the optimal model using the largest value. The final value used for the model was mtry = 10.
In [25]:
confusionMatrix(predict(model_rf, val_test_data), as.factor(val_test_data$outcome))
Confusion Matrix and Statistics Reference Prediction Death Recover Death 3 2 Recover 6 11 Accuracy : 0.6364 95% CI : (0.4066, 0.828) No Information Rate : 0.5909 P-Value [Acc > NIR] : 0.4195 Kappa : 0.1927 Mcnemar's Test P-Value : 0.2888 Sensitivity : 0.3333 Specificity : 0.8462 Pos Pred Value : 0.6000 Neg Pred Value : 0.6471 Prevalence : 0.4091 Detection Rate : 0.1364 Detection Prevalence : 0.2273 Balanced Accuracy : 0.5897 'Positive' Class : Death
In [26]:
varImp(model_rf, scale=TRUE) # Importance of the variable varImp(model_rf, scale=TRUE) %>% plot()
rf variable importance Overall age 100.0000 days_to_outcome 35.4028 days_to_hospital 32.1660 early_outcome 8.8473 Other 4.9845 male 2.6046 hospital 2.5463 Shanghai 0.7959 Zhejiang 0.1249 Jiangsu 0.0000
In [27]:
predict(model_rf, newdata = unknown_data) # Prediction new_h7n9 = unknown_data # Include results in dataset new_h7n9 %>% mutate(outcome=predict(model_rf, newdata=unknown_data))
- Recover
- Recover
- Death
- Recover
- Death
- Death
- Recover
- Recover
- Recover
- Recover
- Death
- Death
- Death
- Recover
- Recover
- Death
- Recover
- Death
- Death
- Death
- Recover
- Recover
- Recover
- Recover
- Death
- Death
- Death
- Recover
- Recover
- Death
- Recover
- Recover
- Recover
- Death
- Recover
- Recover
- Recover
- Recover
- Death
- Recover
- Recover
- Death
- Recover
- Death
- Recover
- Death
- Recover
- Death
- Recover
- Recover
- Recover
- Death
- Recover
- Recover
- Recover
- Recover
- Recover
Levels:
- ‘Death’
- ‘Recover’
case_id | outcome | age | male | hospital | days_to_hospital | days_to_outcome | early_outcome | Jiangsu | Other | Shanghai | Zhejiang | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
<chr> | <fct> | <int> | <int> | <int> | <int> | <int> | <int> | <int> | <int> | <int> | <int> | |
4 | case_4 | Recover | 18 | 0 | 1 | 8 | 46 | 0 | 1 | 0 | 0 | 0 |
9 | case_9 | Recover | 39 | 1 | 1 | 0 | 18 | 0 | 0 | 0 | 0 | 1 |
15 | case_15 | Death | 34 | 0 | 0 | 11 | 38 | 1 | 1 | 0 | 0 | 0 |
16 | case_16 | Recover | 51 | 1 | 0 | 3 | 20 | 0 | 1 | 0 | 0 | 0 |
22 | case_22 | Death | 56 | 1 | 1 | 4 | 17 | 0 | 1 | 0 | 0 | 0 |
28 | case_28 | Death | 51 | 1 | 0 | 6 | 6 | 1 | 0 | 0 | 0 | 1 |
31 | case_31 | Recover | 43 | 1 | 0 | 4 | 21 | 0 | 1 | 0 | 0 | 0 |
32 | case_32 | Recover | 46 | 1 | 0 | 3 | 20 | 0 | 1 | 0 | 0 | 0 |
38 | case_38 | Recover | 28 | 1 | 0 | 2 | 7 | 1 | 1 | 0 | 0 | 0 |
39 | case_39 | Recover | 38 | 1 | 1 | 0 | 18 | 0 | 0 | 0 | 0 | 1 |
40 | case_40 | Death | 46 | 1 | 1 | 5 | 14 | 1 | 0 | 0 | 0 | 1 |
41 | case_41 | Death | 26 | 0 | 1 | 6 | 28 | 0 | 0 | 0 | 0 | 1 |
42 | case_42 | Death | 25 | 1 | 1 | 7 | 38 | 0 | 0 | 0 | 1 | 0 |
47 | case_47 | Recover | 44 | 1 | 0 | 6 | 16 | 0 | 1 | 0 | 0 | 0 |
48 | case_48 | Recover | 37 | 1 | 1 | 6 | 17 | 0 | 0 | 0 | 0 | 1 |
52 | case_52 | Death | 36 | 0 | 0 | 10 | 20 | 1 | 0 | 0 | 0 | 1 |
54 | case_54 | Recover | 47 | 1 | 0 | 0 | 8 | 0 | 0 | 0 | 0 | 1 |
56 | case_56 | Death | 45 | 1 | 1 | 6 | 17 | 1 | 0 | 0 | 1 | 0 |
62 | case_62 | Death | 40 | 0 | 0 | 11 | 8 | 1 | 0 | 0 | 0 | 1 |
63 | case_63 | Death | 33 | 1 | 0 | 7 | 2 | 1 | 0 | 1 | 0 | 0 |
66 | case_66 | Recover | 44 | 1 | 0 | 0 | 2 | 1 | 1 | 0 | 0 | 0 |
67 | case_67 | Recover | 28 | 1 | 0 | 2 | 10 | 0 | 0 | 0 | 0 | 1 |
68 | case_68 | Recover | 29 | 1 | 0 | 0 | 6 | 0 | 0 | 0 | 0 | 1 |
69 | case_69 | Recover | 35 | 1 | 0 | 0 | 17 | 0 | 0 | 0 | 0 | 1 |
70 | case_70 | Death | 30 | 0 | 0 | 8 | 31 | 0 | 0 | 0 | 0 | 1 |
71 | case_71 | Death | 44 | 0 | 0 | 8 | 8 | 0 | 0 | 0 | 0 | 1 |
78 | case_80 | Death | 46 | 1 | 0 | 6 | 7 | 0 | 0 | 0 | 0 | 1 |
82 | case_84 | Recover | 6 | 0 | 0 | 7 | 22 | 1 | 1 | 0 | 0 | 0 |
83 | case_85 | Recover | 52 | 0 | 1 | 7 | 38 | 1 | 0 | 0 | 1 | 0 |
84 | case_86 | Death | 26 | 0 | 0 | 8 | 11 | 1 | 0 | 0 | 0 | 1 |
86 | case_88 | Recover | 15 | 1 | 0 | 4 | 18 | 1 | 0 | 1 | 0 | 0 |
88 | case_90 | Recover | 17 | 1 | 1 | 3 | 11 | 0 | 0 | 0 | 0 | 1 |
90 | case_92 | Recover | 38 | 0 | 1 | 7 | 22 | 0 | 0 | 0 | 0 | 1 |
91 | case_93 | Death | 28 | 1 | 0 | 5 | 7 | 0 | 0 | 0 | 0 | 1 |
93 | case_95 | Recover | 13 | 1 | 0 | 2 | 22 | 0 | 0 | 0 | 0 | 1 |
94 | case_96 | Recover | 17 | 1 | 0 | 4 | 22 | 0 | 1 | 0 | 0 | 0 |
97 | case_99 | Recover | 40 | 0 | 0 | 8 | 17 | 0 | 0 | 0 | 0 | 1 |
98 | case_100 | Recover | 30 | 1 | 0 | 11 | 17 | 0 | 0 | 0 | 0 | 1 |
99 | case_101 | Death | 51 | 0 | 0 | 11 | 10 | 0 | 0 | 0 | 0 | 1 |
100 | case_102 | Recover | 53 | 1 | 0 | 0 | 8 | 0 | 0 | 0 | 0 | 1 |
101 | case_103 | Recover | 40 | 1 | 1 | 6 | 32 | 0 | 1 | 0 | 0 | 0 |
102 | case_104 | Death | 26 | 0 | 0 | 7 | 8 | 0 | 0 | 0 | 0 | 1 |
103 | case_105 | Recover | 9 | 1 | 0 | 11 | 17 | 1 | 0 | 0 | 0 | 1 |
106 | case_108 | Death | 55 | 1 | 0 | 3 | 22 | 0 | 0 | 0 | 0 | 1 |
107 | case_109 | Recover | 35 | 1 | 0 | 0 | 8 | 0 | 0 | 0 | 0 | 1 |
108 | case_110 | Death | 25 | 1 | 1 | 4 | 32 | 0 | 0 | 1 | 0 | 0 |
110 | case_112 | Recover | 41 | 1 | 0 | 1 | 11 | 0 | 0 | 1 | 0 | 0 |
111 | case_113 | Death | 33 | 0 | 0 | 7 | 13 | 1 | 0 | 0 | 0 | 1 |
112 | case_114 | Recover | 22 | 0 | 0 | 10 | 8 | 0 | 0 | 0 | 0 | 1 |
113 | case_115 | Recover | 14 | 1 | 0 | 7 | 10 | 0 | 0 | 0 | 0 | 1 |
116 | case_118 | Recover | 21 | 1 | 0 | 6 | 57 | 0 | 1 | 0 | 0 | 0 |
118 | case_120 | Death | 33 | 1 | 0 | 3 | 38 | 0 | 1 | 0 | 0 | 0 |
120 | case_122 | Recover | 14 | 1 | 0 | 5 | 18 | 0 | 0 | 0 | 0 | 1 |
124 | case_126 | Recover | 52 | 1 | 1 | 10 | 31 | 0 | 0 | 1 | 0 | 0 |
128 | case_130 | Recover | 41 | 1 | 1 | 1 | 20 | 0 | 0 | 1 | 0 | 0 |
130 | case_132 | Recover | 51 | 0 | 1 | 0 | 24 | 0 | 0 | 1 | 0 | 0 |
134 | case_136 | Recover | 23 | 0 | 1 | 1 | 13 | 0 | 0 | 1 | 0 | 0 |
In [28]:
summary(predict(model_rf, newdata = unknown_data))
Death21Recover36
Results
From the data obtained, it can be determined that of the 57 patients with unknown prognosis, 21 will die and 36 will recover, indicating a mortality rate of 36.84% for this set of data.
Regarding the patients who will die, there are the following data:
- 52.38% (11 patients) are of the gender male, and 47.62% (10 patients) are of the Female gender.
- 27.27% (6 patients) were hospitalized, while 72.73% (15 patients) were not.
- 14.28% (3 patients) come from Jiangsu, 66.66% (14 patients) come from Zhenjiang, the 9.52% (2 patients) come from Shanghai and the 9.52% (2 patients) come from other locations.
- 42.85% (9 patients) had the disease less time than usual, which indicates that They died more quickly.
Regarding the patients who will recover, have the following data:
- 77.77% (28 patients) are of the gender male, and 22.22% (8 patients) are of the Female gender.
- 33.33% (12 patients) were hospitalized, while 66.66% (24 patients) were not.
- 30.55% (11 patients) come from Jiangsu, the 50% (18 patients) come from Zhenjiang, the 2.77% (1 patient) comes from Shanghai and the 16.68% (6 patients) come from other locations.
- 16.66% (6 patients) had the disease less time than usual, which indicates that They overcame the disease more quickly.
In [29]:
unknown_data$outcome = c('Recover', 'Recover', 'Death', 'Recover', 'Death', 'Death', 'Recover', 'Recover', 'Recover', 'Recover', 'Death', 'Death', 'Death', 'Recover', 'Recover', 'Death', 'Recover', 'Death', 'Death', 'Death', 'Recover', 'Recover', 'Recover', 'Recover', 'Death', 'Death', 'Death', 'Recover', 'Recover', 'Death', 'Recover', 'Recover', 'Recover', 'Death', 'Recover', 'Recover', 'Recover', 'Recover', 'Death', 'Recover', 'Recover', 'Death', 'Recover', 'Death', 'Recover', 'Death', 'Recover', 'Death', 'Recover', 'Recover', 'Recover', 'Death', 'Recover', 'Recover', 'Recover', 'Recover', 'Recover') par(mfrow=c(1,2)) plot(density(unknown_data$age), main = "Final results of the predictive model", xlab = "Forecast", ylab = "Frequency") hist(unknown_data$age, main = "Histogram of frequencies", xlab = "Age", ylab = "Frequency", col = "red", border = "black") death_male = unknown_data[ which(unknown_data$male=='1' & unknown_data$outcome=='Death'),] death_female = unknown_data[ which(unknown_data$male=='0' & unknown_data$outcome=='Death'),] death_hospital = unknown_data[ which(unknown_data$hospital=='1' & unknown_data$outcome=='Death'),] death_not_hospital = unknown_data[ which(unknown_data$hospital=='0' & unknown_data$outcome=='Death'),] death_jiangsu = unknown_data[ which(unknown_data$Jiangsu=='1' & unknown_data$outcome=='Death'),] death_zhejiang = unknown_data[ which(unknown_data$Zhejiang =='1' & unknown_data$outcome=='Death'),] death_shanghai = unknown_data[ which(unknown_data$Shanghai=='1' & unknown_data$outcome=='Death'),] death_other = unknown_data[ which(unknown_data$Other=='1' & unknown_data$outcome=='Death'),] death_early_outcome = unknown_data[ which(unknown_data$early_outcome=='1' & unknown_data$outcome=='Death'),] death_not_early_outcome = unknown_data[ which(unknown_data$early_outcome=='0' & unknown_data$outcome=='Death'),] recover_male = unknown_data[ which(unknown_data$male=='1' & unknown_data$outcome=='Recover'),] recover_female = unknown_data[ which(unknown_data$male=='0' & unknown_data$outcome=='Recover'),] recover_hospital = unknown_data[ which(unknown_data$hospital=='1' & unknown_data$outcome=='Recover'),] recover_not_hospital = unknown_data[ which(unknown_data$hospital=='0' & unknown_data$outcome=='Recover'),] recover_jiangsu = unknown_data[ which(unknown_data$Jiangsu=='1' & unknown_data$outcome=='Recover'),] recover_zhejiang = unknown_data[ which(unknown_data$Zhejiang =='1' & unknown_data$outcome=='Recover'),] recover_shanghai = unknown_data[ which(unknown_data$Shanghai=='1' & unknown_data$outcome=='Recover'),] recover_other = unknown_data[ which(unknown_data$Other=='1' & unknown_data$outcome=='Recover'),] recover_early_outcome = unknown_data[ which(unknown_data$early_outcome=='1' & unknown_data$outcome=='Recover'),] recover_not_early_outcome = unknown_data[ which(unknown_data$early_outcome=='0' & unknown_data$outcome=='Recover'),] count(death_early_outcome)
n |
---|
<int> |
9 |
In [33]:
# Pie Chart x <- c(22, 35) labels <- c("Deceased","Recovered") piepercent<- round(100*x/sum(x), 1) pie(x, labels = piepercent, main = "Deaths & Recoveries comparison",col = rainbow(length(x))) legend("topright", c("Deceased","Recovered"), cex = 0.8, fill = rainbow(length(x))) par(mfrow=c(2,2)) hist(death_male$age, main = "Total deaths by male gender", xlab = "Age", ylab = "Frequency", col = "red", border = "black") hist(death_female$age, main = "Total deaths by female gender", xlab = "Age", ylab = "Frequency", col = "red", border = "black") hist(recover_male$age, main = "Total recovered by male gender", xlab = "Age", ylab = "Frequency", col = "red", border = "black") hist(recover_female$age, main = "Total recovered by female gender", xlab = "Age", ylab = "Frequency", col = "red", border = "black")
In [31]:
par(mfrow=c(4,2)) hist(death_jiangsu$age, main = "Total deaths in Jiangsu", xlab = "Age", ylab = "Frequency", col = "red", border = "black") hist(recover_jiangsu$age, main = "Total recovered in Jiangsu", xlab = "Age", ylab = "Frequency", col = "red", border = "black") hist(death_shanghai$age, main = "Total deaths in Shanghai", xlab = "Age", ylab = "Frequency", col = "red", border = "black") hist(recover_shanghai$age, main = "Total recovered in Shanghai", xlab = "Age", ylab = "Frequency", col = "red", border = "black") hist(death_zhejiang$age, main = "Total deaths in Zhejiang", xlab = "Age", ylab = "Frequency", col = "red", border = "black") hist(recover_zhejiang$age, main = "Total recovered in Zhejiang", xlab = "Age", ylab = "Frequency", col = "red", border = "black") hist(death_other$age, main = "Total deaths in other location", xlab = "Age", ylab = "Frequency", col = "red", border = "black") hist(recover_other$age, main = "Total recovered in other location ", xlab = "Age", ylab = "Frequency", col = "red", border = "black")
In [ ]: