Forecasting model for influenza A cases (H7N9) based on Random Forests

The objective of this study is to present a machine learning model based on Random Forests that allows predicting the number of deceased and recovered patients who have contracted the influenza A (H7N9) virus in China during 2013. To do this, I used a data warehouse with information about 134 patients, of which 31 are known to have died, 46 have recovered and 57 have an unknown prognosis. For the development of this model, the R programming language, RStudio software, and libraries were used, among which ggplot2 and caret stand out.

Introduction

The influenza A(H7N9) virus is part of a subgroup of influenza viruses that normally circulate in birds. Now it’s producing infections human beings, a phenomenon that until recently had not been observed. Existing information on the scope of the disease caused by this virus and on the source of exposure is scarce. the disease is worrying because it has been serious in most the cases. At the moment there are no indications that can be transmitted from person to person, but they are actively investigating transmission routes both from animals to people and from person to person (World Health Organization, 2017). Data mining is a field of statistics and computer science referred to the process that attempt to discover patterns in large volumes of data sets. Use the methods of artificial intelligence, machine learning, statistics and database systems. The objective overview of the data mining process consists of extract information from a data set transform it into an understandable structure for your later use. Through predictive methods used in mining of data, it is possible to obtain forecasts of viral diseases in population groups defined.

Case study

The general objective of this research is to obtain an optimal model that allows generating forecasts mortality and recovery in carrier patients of the influenza A-H7N9 virus. The specific objectives are the following:

Carry out an exploratory analysis of the data based in the study of some trend measures central (mean, median, and quartiles)
Develop, train, and test a random forest model.

Methodology

Through the R Studio software, a exploratory data analysis, based on the study of measures of central tendency (mean, median and quartiles). Measures of central tendency are measures statistics that seek to summarize in a single value to a set of values. They represent a center which the data set is located. (Quevedo, 2011).

Subsequently, through the “caret” library of R, build and evaluate models from the random forest predictive methods.

Some parameters generated by the used models are the following:

Mtry: the algorithm will select the mtry number of predictors to attempt a split for classification when constructing a classification tree.
Accuracy: the accuracy of the predictor refers to how well a given predictor can guess the predicted attribute value for a new data.
Kappa: is a metric that compares a observed precision with an expected precision (random probability). It is used not only for evaluate a single classifier, but also to evaluate classifiers against each other. Also has taking into account random chance (according to a random classifier), which is usually means that it is less misleading than simply use precision as a metric. Landis and Koch consider in their investigations the values 0-0.20 as mild, 0.21-0.40 as regular, 0.41-0.60 as moderate, 0.61-0.80 as substantial and 0.81-1 like almost perfect.

Kappa = $\frac{observed precision – expected precision}{1 – expected precision}$

The Random Forest method is a combination of predictor trees such that each tree depends on the values of a random vector tested independently and with the same distribution for each one of these. It is a substantial modification of bagging that builds a long collection of trees uncorrelated and then averages them. (Breimann L, 2001).

Application

Exploratory analysis of data

The file to analyze is a .csv with the data of 134 h7n9 virus patients, of whom 31 are known to have died, 46 have recovered and 57 have a unknown prognosis. The variables that make up the dataset are the following:

case_id: Patient identifier.
outcome: Prognosis if any (recovered or deceased).
age: Age of the individual.
male: Gender (1 = masculine, 0 = feminine)
hospital: Boolean data that indicates if the patient has been hospitalized
days_to_hospital: number of days elapsed between the onset of the disease and hospitalization
days_to_outcome: number of days elapsed between the beginning and the end of the disease.
early_outcome: Indicates if the disease has lasted less time than the average in the dataset.
Jiangsu, Shanghai, Zhejiang, Other: Variables booleans that indicate the place of origin of the patient.

For this analysis, we start by importing the libraries that we will be using.

In [ ]:

library(dplyr)
library(readr)
library(tidyr)
library(ggplot2)
library(caret)
library(gbm)
library(rpart)
library(rattle)
library(rpart.plot)
library(RColorBrewer)

Here we can see the full dataset.

In [15]:

h7n9 <- read.csv("../input/chinah7n9/h7n9.csv")
h7n9

case_id	outcome	age	male	hospital	days_to_hospital	days_to_outcome	early_outcome	Jiangsu	Other	Shanghai	Zhejiang
<chr>	<chr>	<int>	<int>	<int>	<int>	<int>	<int>	<int>	<int>	<int>	<int>
case_1	Death	58	1	0	4	13	1	0	0	1	0
case_2	Death	7	1	1	4	11	1	0	0	1	0
case_3	Death	11	0	1	10	31	1	0	1	0	0
case_4	NA	18	0	1	8	46	0	1	0	0	0
case_5	Recover	20	0	1	11	57	0	1	0	0	0
case_6	Death	9	0	1	7	36	1	1	0	0	0
case_7	Death	54	1	1	9	20	1	1	0	0	0
case_8	Death	14	1	1	11	20	1	0	0	0	1
case_9	NA	39	1	1	0	18	0	0	0	0	1
case_10	Death	20	1	1	4	6	1	0	0	1	0
case_11	Death	36	1	1	2	6	1	0	0	0	1
case_12	Death	24	0	0	6	7	1	0	0	1	0
case_13	Death	39	0	1	3	12	1	0	0	1	0
case_14	Recover	15	1	0	4	10	1	0	0	1	0
case_15	NA	34	0	0	11	38	1	1	0	0	0
case_16	NA	51	1	0	3	20	0	1	0	0	0
case_17	Death	46	1	1	6	14	1	0	0	1	0
case_18	Recover	38	1	1	4	20	1	0	0	1	0
case_19	Death	31	1	1	5	67	0	0	0	1	0
case_20	Recover	27	1	1	4	22	1	0	1	0	0
case_21	Recover	39	1	1	1	23	1	0	0	1	0
case_22	NA	56	1	1	4	17	0	1	0	0	0
case_23	Recover	5	0	1	0	46	0	1	0	0	0
case_24	Death	36	1	0	6	6	1	0	0	1	0
case_25	Recover	35	1	1	0	35	0	0	0	1	0
case_26	Death	49	1	1	4	11	1	0	0	1	0
case_27	Recover	23	0	1	27	37	1	0	0	0	1
case_28	NA	51	1	0	6	6	1	0	0	0	1
case_29	Recover	48	0	1	4	32	0	0	0	1	0
case_30	Recover	53	0	0	6	23	0	0	0	1	0
⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮
case_107	Death	61	1	0	7	21	0	0	1	0	0
case_108	NA	55	1	0	3	22	0	0	0	0	1
case_109	NA	35	1	0	0	8	0	0	0	0	1
case_110	NA	25	1	1	4	32	0	0	1	0	0
case_111	Death	28	1	0	4	22	0	0	1	0	0
case_112	NA	41	1	0	1	11	0	0	1	0	0
case_113	NA	33	0	0	7	13	1	0	0	0	1
case_114	NA	22	0	0	10	8	0	0	0	0	1
case_115	NA	14	1	0	7	10	0	0	0	0	1
case_116	Recover	37	1	1	5	21	0	0	1	0	0
case_117	Recover	48	0	0	4	28	0	0	1	0	0
case_118	NA	21	1	0	6	57	0	1	0	0	0
case_119	Recover	12	1	0	6	26	0	1	0	0	0
case_120	NA	33	1	0	3	38	0	1	0	0	0
case_121	Death	36	0	1	5	30	0	0	1	0	0
case_122	NA	14	1	0	5	18	0	0	0	0	1
case_123	Death	26	1	1	7	16	0	0	1	0	0
case_124	Recover	52	1	0	7	17	0	0	1	0	0
case_125	Recover	8	0	0	0	20	0	0	1	0	0
case_126	NA	52	1	1	10	31	0	0	1	0	0
case_127	Recover	15	1	0	5	9	0	0	1	0	0
case_128	Recover	30	1	1	7	22	0	0	1	0	0
case_129	Recover	41	1	1	4	28	0	0	1	0	0
case_130	NA	41	1	1	1	20	0	0	1	0	0
case_131	Recover	60	1	1	1	2	1	0	1	0	0
case_132	NA	51	0	1	0	24	0	0	1	0	0
case_133	Recover	32	1	1	0	2	0	0	1	0	0
case_134	Recover	2	1	1	1	7	0	1	0	0	0
case_135	Death	34	0	1	3	32	0	0	1	0	0
case_136	NA	23	0	1	1	13	0	0	1	0	0

Below is a summary of the data categorized by age:

In [16]:

summary(h7n9$age)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   2.00   20.25   34.00   32.46   44.75   61.00

Let’s build some charts to get more insights from the data.

In [17]:

plot(density(h7n9$age), 
     main = "Density histogram",
     xlab = "Age (years old)",
     ylab = "Density")

In [18]:

ggplot(h7n9, aes(age)) + geom_density(aes(fill=outcome), alpha=1/3)

In [19]:

summary(h7n9['age'])
boxplot(h7n9['age'], main = "Distribution of patients by age",
        xlab = "Age",
        ylab = "Patients",
        col = "orange",
        border = "brown",
        horizontal = TRUE,
        notch = TRUE)

      age       
 Min.   : 2.00  
 1st Qu.:20.25  
 Median :34.00  
 Mean   :32.46  
 3rd Qu.:44.75  
 Max.   :61.00

In [20]:

summary(h7n9[h7n9$outcome=='Death', 'age'])
boxplot(h7n9[h7n9$outcome=='Death', 'age'], main = "Distribution of deceased patients by age",
        xlab = "Age",
        ylab = "Patients",
        col = "orange",
        border = "brown",
        horizontal = TRUE,
        notch = TRUE)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   7.00   27.50   36.00   37.19   49.00   61.00      57

In [21]:

summary(h7n9[h7n9$outcome=='Recover', 'age'])
boxplot(h7n9[h7n9$outcome=='Recover', 'age'], main = "Distribution of recovered patients by age",
        xlab = "Age",
        ylab = "Patients",
        col = "orange",
        border = "brown",
        horizontal = TRUE,
        notch = TRUE)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   2.00   13.25   26.00   26.89   39.75   60.00      57

In [22]:

summary(h7n9[is.na(h7n9$outcome), 'age'])
boxplot(h7n9[is.na(h7n9$outcome), 'age'], main = "Distribution of patients without prognosis by age",
        xlab = "Age",
        ylab = "Patients",
        col = "orange",
        border = "brown",
        horizontal = TRUE,
        notch = TRUE)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   6.00   26.00   35.00   34.39   44.00   56.00

In the distribution of all patients, age minimum is 2 years, the average age is 32.46 years and the maximum age is 61 years.

In the distribution of deceased patients, age minimum is 7 years, the average age is 37.19 years and the maximum age is 61 years.

In the distribution of recovered patients, the minimum age is 2 years, average age is 26.89 years and the maximum age is 60 years.

In the distribution of patients with a prognosis unknown, the minimum age is 6 years, the age average is 34.39 years and the maximum age is 56 years.

We can see that the average age of the deceased patients is older than the average age of recovered patients, so we can infer that age is an important variable when establish predictive models on this set of data. To reach this conclusion, it is not necessary to data transformation because it is not being using no parametric analysis method for this (regression, student’s t, correlation, ANOVA, etc).

Before applying the predictive methods, divide the data in training and test groups. The test data is composed of the 57 cases with unknown prognosis. The training data is split for validation of models: 70% of training data will be kept for the construction of the model and the The remaining 30% will be used for model testing. This ratio was used because of the importance is to use most of the data to train the model, and at the same time leave a proportion significant to run the tests.

In [23]:

unknown_index <- which(is.na(h7n9$outcome))
unknown_data = h7n9[unknown_index, ]
train_data <- h7n9[-unknown_index, ][,-1]

set.seed(1275)
val_index <- createDataPartition(train_data$outcome, p = 0.7, list=FALSE) # training data indices
val_train_data <- train_data[val_index, ] # training data
val_test_data  <- train_data[-val_index, ] # test data

Random Forest

This model has an accuracy of 76.73% with the following parameters:

mtry = 10
kappa = 0.4970

Obtaining the confusion matrix from prediction made, 14 results recorded hits and 8 wrong results, which represents an accuracy of 63.64%. The most important variable for the model is the age.

In [24]:

model_rf <- caret::train(outcome ~ .,
                         data = val_train_data,
                         method = "rf",
                         preProcess = NULL,
                         trControl = trainControl(method = "repeatedcv", number = 10, repeats = 10, verboseIter = FALSE))

model_rf

Random Forest 

55 samples
10 predictors
 2 classes: 'Death', 'Recover' 

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 10 times) 
Summary of sample sizes: 49, 50, 50, 49, 49, 50, ... 
Resampling results across tuning parameters:

  mtry  Accuracy   Kappa    
   2    0.6804762  0.2994427
   6    0.7564286  0.4744705
  10    0.7585238  0.4872897

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mtry = 10.

In [25]:

confusionMatrix(predict(model_rf, val_test_data), as.factor(val_test_data$outcome))

Confusion Matrix and Statistics

          Reference
Prediction Death Recover
   Death       3       2
   Recover     6      11

               Accuracy : 0.6364         
                 95% CI : (0.4066, 0.828)
    No Information Rate : 0.5909         
    P-Value [Acc > NIR] : 0.4195         

                  Kappa : 0.1927         

 Mcnemar's Test P-Value : 0.2888         

            Sensitivity : 0.3333         
            Specificity : 0.8462         
         Pos Pred Value : 0.6000         
         Neg Pred Value : 0.6471         
             Prevalence : 0.4091         
         Detection Rate : 0.1364         
   Detection Prevalence : 0.2273         
      Balanced Accuracy : 0.5897         

       'Positive' Class : Death

In [26]:

varImp(model_rf, scale=TRUE) # Importance of the variable
varImp(model_rf, scale=TRUE) %>% plot()

rf variable importance

                  Overall
age              100.0000
days_to_outcome   35.4028
days_to_hospital  32.1660
early_outcome      8.8473
Other              4.9845
male               2.6046
hospital           2.5463
Shanghai           0.7959
Zhejiang           0.1249
Jiangsu            0.0000

In [27]:

predict(model_rf, newdata = unknown_data)  # Prediction

new_h7n9 = unknown_data # Include results in dataset
new_h7n9 %>%
  mutate(outcome=predict(model_rf, newdata=unknown_data))

Recover
Recover
Death
Recover
Death
Death
Recover
Recover
Recover
Recover
Death
Death
Death
Recover
Recover
Death
Recover
Death
Death
Death
Recover
Recover
Recover
Recover
Death
Death
Death
Recover
Recover
Death
Recover
Recover
Recover
Death
Recover
Recover
Recover
Recover
Death
Recover
Recover
Death
Recover
Death
Recover
Death
Recover
Death
Recover
Recover
Recover
Death
Recover
Recover
Recover
Recover
Recover

Levels:

‘Death’
‘Recover’

	case_id	outcome	age	male	hospital	days_to_hospital	days_to_outcome	early_outcome	Jiangsu	Other	Shanghai	Zhejiang
	<chr>	<fct>	<int>	<int>	<int>	<int>	<int>	<int>	<int>	<int>	<int>	<int>
4	case_4	Recover	18	0	1	8	46	0	1	0	0	0
9	case_9	Recover	39	1	1	0	18	0	0	0	0	1
15	case_15	Death	34	0	0	11	38	1	1	0	0	0
16	case_16	Recover	51	1	0	3	20	0	1	0	0	0
22	case_22	Death	56	1	1	4	17	0	1	0	0	0
28	case_28	Death	51	1	0	6	6	1	0	0	0	1
31	case_31	Recover	43	1	0	4	21	0	1	0	0	0
32	case_32	Recover	46	1	0	3	20	0	1	0	0	0
38	case_38	Recover	28	1	0	2	7	1	1	0	0	0
39	case_39	Recover	38	1	1	0	18	0	0	0	0	1
40	case_40	Death	46	1	1	5	14	1	0	0	0	1
41	case_41	Death	26	0	1	6	28	0	0	0	0	1
42	case_42	Death	25	1	1	7	38	0	0	0	1	0
47	case_47	Recover	44	1	0	6	16	0	1	0	0	0
48	case_48	Recover	37	1	1	6	17	0	0	0	0	1
52	case_52	Death	36	0	0	10	20	1	0	0	0	1
54	case_54	Recover	47	1	0	0	8	0	0	0	0	1
56	case_56	Death	45	1	1	6	17	1	0	0	1	0
62	case_62	Death	40	0	0	11	8	1	0	0	0	1
63	case_63	Death	33	1	0	7	2	1	0	1	0	0
66	case_66	Recover	44	1	0	0	2	1	1	0	0	0
67	case_67	Recover	28	1	0	2	10	0	0	0	0	1
68	case_68	Recover	29	1	0	0	6	0	0	0	0	1
69	case_69	Recover	35	1	0	0	17	0	0	0	0	1
70	case_70	Death	30	0	0	8	31	0	0	0	0	1
71	case_71	Death	44	0	0	8	8	0	0	0	0	1
78	case_80	Death	46	1	0	6	7	0	0	0	0	1
82	case_84	Recover	6	0	0	7	22	1	1	0	0	0
83	case_85	Recover	52	0	1	7	38	1	0	0	1	0
84	case_86	Death	26	0	0	8	11	1	0	0	0	1
86	case_88	Recover	15	1	0	4	18	1	0	1	0	0
88	case_90	Recover	17	1	1	3	11	0	0	0	0	1
90	case_92	Recover	38	0	1	7	22	0	0	0	0	1
91	case_93	Death	28	1	0	5	7	0	0	0	0	1
93	case_95	Recover	13	1	0	2	22	0	0	0	0	1
94	case_96	Recover	17	1	0	4	22	0	1	0	0	0
97	case_99	Recover	40	0	0	8	17	0	0	0	0	1
98	case_100	Recover	30	1	0	11	17	0	0	0	0	1
99	case_101	Death	51	0	0	11	10	0	0	0	0	1
100	case_102	Recover	53	1	0	0	8	0	0	0	0	1
101	case_103	Recover	40	1	1	6	32	0	1	0	0	0
102	case_104	Death	26	0	0	7	8	0	0	0	0	1
103	case_105	Recover	9	1	0	11	17	1	0	0	0	1
106	case_108	Death	55	1	0	3	22	0	0	0	0	1
107	case_109	Recover	35	1	0	0	8	0	0	0	0	1
108	case_110	Death	25	1	1	4	32	0	0	1	0	0
110	case_112	Recover	41	1	0	1	11	0	0	1	0	0
111	case_113	Death	33	0	0	7	13	1	0	0	0	1
112	case_114	Recover	22	0	0	10	8	0	0	0	0	1
113	case_115	Recover	14	1	0	7	10	0	0	0	0	1
116	case_118	Recover	21	1	0	6	57	0	1	0	0	0
118	case_120	Death	33	1	0	3	38	0	1	0	0	0
120	case_122	Recover	14	1	0	5	18	0	0	0	0	1
124	case_126	Recover	52	1	1	10	31	0	0	1	0	0
128	case_130	Recover	41	1	1	1	20	0	0	1	0	0
130	case_132	Recover	51	0	1	0	24	0	0	1	0	0
134	case_136	Recover	23	0	1	1	13	0	0	1	0	0

In [28]:

summary(predict(model_rf, newdata = unknown_data))

Death21Recover36

Results

From the data obtained, it can be determined that of the 57 patients with unknown prognosis, 21 will die and 36 will recover, indicating a mortality rate of 36.84% for this set of data.

Regarding the patients who will die, there are the following data:

52.38% (11 patients) are of the gender male, and 47.62% (10 patients) are of the Female gender.
27.27% (6 patients) were hospitalized, while 72.73% (15 patients) were not.
14.28% (3 patients) come from Jiangsu, 66.66% (14 patients) come from Zhenjiang, the 9.52% (2 patients) come from Shanghai and the 9.52% (2 patients) come from other locations.
42.85% (9 patients) had the disease less time than usual, which indicates that They died more quickly.

Regarding the patients who will recover, have the following data:

77.77% (28 patients) are of the gender male, and 22.22% (8 patients) are of the Female gender.
33.33% (12 patients) were hospitalized, while 66.66% (24 patients) were not.
30.55% (11 patients) come from Jiangsu, the 50% (18 patients) come from Zhenjiang, the 2.77% (1 patient) comes from Shanghai and the 16.68% (6 patients) come from other locations.
16.66% (6 patients) had the disease less time than usual, which indicates that They overcame the disease more quickly.

In [29]:

unknown_data$outcome = c('Recover', 'Recover', 'Death', 'Recover', 'Death', 'Death', 'Recover', 'Recover', 'Recover', 'Recover', 'Death', 'Death', 'Death', 'Recover', 
                         'Recover', 'Death', 'Recover', 'Death', 'Death', 'Death', 'Recover', 'Recover', 'Recover', 'Recover', 'Death', 'Death', 'Death', 'Recover', 'Recover', 
                         'Death', 'Recover', 'Recover', 'Recover', 'Death', 'Recover', 'Recover', 'Recover', 'Recover', 'Death', 'Recover', 'Recover', 'Death', 'Recover', 
                         'Death', 'Recover', 'Death', 'Recover', 'Death', 'Recover', 'Recover', 'Recover', 'Death', 'Recover', 'Recover', 'Recover', 'Recover', 'Recover')

par(mfrow=c(1,2))
plot(density(unknown_data$age), main = "Final results of the predictive model",
     xlab = "Forecast", ylab = "Frequency")


hist(unknown_data$age, main = "Histogram of frequencies",
     xlab = "Age",
     ylab = "Frequency",
     col = "red",
     border = "black")

death_male = unknown_data[ which(unknown_data$male=='1' & unknown_data$outcome=='Death'),]
death_female = unknown_data[ which(unknown_data$male=='0' & unknown_data$outcome=='Death'),]

death_hospital = unknown_data[ which(unknown_data$hospital=='1' & unknown_data$outcome=='Death'),]
death_not_hospital = unknown_data[ which(unknown_data$hospital=='0' & unknown_data$outcome=='Death'),]

death_jiangsu = unknown_data[ which(unknown_data$Jiangsu=='1' & unknown_data$outcome=='Death'),]
death_zhejiang = unknown_data[ which(unknown_data$Zhejiang =='1' & unknown_data$outcome=='Death'),]
death_shanghai = unknown_data[ which(unknown_data$Shanghai=='1' & unknown_data$outcome=='Death'),]
death_other = unknown_data[ which(unknown_data$Other=='1' & unknown_data$outcome=='Death'),]

death_early_outcome = unknown_data[ which(unknown_data$early_outcome=='1' & unknown_data$outcome=='Death'),]
death_not_early_outcome = unknown_data[ which(unknown_data$early_outcome=='0' & unknown_data$outcome=='Death'),]

recover_male = unknown_data[ which(unknown_data$male=='1' & unknown_data$outcome=='Recover'),]
recover_female = unknown_data[ which(unknown_data$male=='0' & unknown_data$outcome=='Recover'),]

recover_hospital = unknown_data[ which(unknown_data$hospital=='1' & unknown_data$outcome=='Recover'),]
recover_not_hospital = unknown_data[ which(unknown_data$hospital=='0' & unknown_data$outcome=='Recover'),]

recover_jiangsu = unknown_data[ which(unknown_data$Jiangsu=='1' & unknown_data$outcome=='Recover'),]
recover_zhejiang = unknown_data[ which(unknown_data$Zhejiang =='1' & unknown_data$outcome=='Recover'),]
recover_shanghai = unknown_data[ which(unknown_data$Shanghai=='1' & unknown_data$outcome=='Recover'),]
recover_other = unknown_data[ which(unknown_data$Other=='1' & unknown_data$outcome=='Recover'),]

recover_early_outcome = unknown_data[ which(unknown_data$early_outcome=='1' & unknown_data$outcome=='Recover'),]
recover_not_early_outcome = unknown_data[ which(unknown_data$early_outcome=='0' & unknown_data$outcome=='Recover'),]

count(death_early_outcome)

n
<int>
9

In [33]:

# Pie Chart
x <-  c(22, 35)
labels <-  c("Deceased","Recovered")
piepercent<- round(100*x/sum(x), 1)
pie(x, labels = piepercent, main = "Deaths & Recoveries comparison",col = rainbow(length(x)))
legend("topright", c("Deceased","Recovered"), cex = 0.8,
       fill = rainbow(length(x)))

par(mfrow=c(2,2))

hist(death_male$age, main = "Total deaths by male gender",
     xlab = "Age",
     ylab = "Frequency",
     col = "red",
     border = "black")

hist(death_female$age, main = "Total deaths by female gender",
     xlab = "Age",
     ylab = "Frequency",
     col = "red",
     border = "black")

hist(recover_male$age, main = "Total recovered by male gender",
     xlab = "Age",
     ylab = "Frequency",
     col = "red",
     border = "black")

hist(recover_female$age, main = "Total recovered by female gender",
     xlab = "Age",
     ylab = "Frequency",
     col = "red",
     border = "black")

In [31]:

par(mfrow=c(4,2))
hist(death_jiangsu$age, main = "Total deaths in Jiangsu",
     xlab = "Age",
     ylab = "Frequency",
     col = "red",
     border = "black")

hist(recover_jiangsu$age, main = "Total recovered in Jiangsu",
     xlab = "Age",
     ylab = "Frequency",
     col = "red",
     border = "black")

hist(death_shanghai$age, main = "Total deaths in Shanghai",
     xlab = "Age",
     ylab = "Frequency",
     col = "red",
     border = "black")

hist(recover_shanghai$age, main = "Total recovered in Shanghai",
     xlab = "Age",
     ylab = "Frequency",
     col = "red",
     border = "black")

hist(death_zhejiang$age, main = "Total deaths in Zhejiang",
     xlab = "Age",
     ylab = "Frequency",
     col = "red",
     border = "black")

hist(recover_zhejiang$age, main = "Total recovered in Zhejiang",
     xlab = "Age",
     ylab = "Frequency",
     col = "red",
     border = "black")

hist(death_other$age, main = "Total deaths in other location",
     xlab = "Age",
     ylab = "Frequency",
     col = "red",
     border = "black")

hist(recover_other$age, main = "Total recovered in other location
",
     xlab = "Age",
     ylab = "Frequency",
     col = "red",
     border = "black")

In [ ]:

Blog Post

Forecasting model for influenza A cases (H7N9) based on Random Forests

Introduction

Case study

Methodology

Application

Exploratory analysis of data

Random Forest

Results

Sergio Alves

Latest posts

Parking Lot Occupancy K-means Analysis in National Parks in the United States

Forecasting model for influenza A cases (H7N9) based on Random Forests

Walmart — Store Sales Forecasting

Hotel Booking Demand EDA

Archives

Categories

Parking Lot Occupancy K-means Analysis in National Parks in the United States

Forecasting model for influenza A cases (H7N9) based on Random Forests

Walmart — Store Sales Forecasting

Blog Post

Forecasting model for influenza A cases (H7N9) based on Random Forests

Introduction

Case study

Methodology

Application

Exploratory analysis of data

Random Forest

Results

Sergio Alves

Related Posts

Image Classification with COIL-100 Dataset in PyTorch

5 NumPy Functions that you Should Know

Hotel Booking Demand EDA

Latest posts

Parking Lot Occupancy K-means Analysis in National Parks in the United States

Forecasting model for influenza A cases (H7N9) based on Random Forests

Walmart — Store Sales Forecasting

Hotel Booking Demand EDA

Parking Lot Occupancy K-means Analysis in National Parks in the United States

Forecasting model for influenza A cases (H7N9) based on Random Forests

Walmart — Store Sales Forecasting