Influenza activity prediction using meteorological factors in a warm temperate to subtropical transitional zone, Eastern China

Wendong Liu; Qigang Dai; Jing Bao; Wenqi Shen; Ying Wu; Yingying Shi; Ke Xu; Jianli Hu; Changjun Bao; Xiang Huo

doi:10.1017/S0950268819002140

Influenza activity prediction using meteorological factors in a warm temperate to subtropical transitional zone, Eastern China

Published online by Cambridge University Press: 20 December 2019

Wendong Liu

Qigang Dai ,

Jing Bao ,

Wenqi Shen ,

Ying Wu ,

Yingying Shi ,

Ke Xu ,

Jianli Hu ,

Changjun Bao

and

Xiang Huo

Show author details

Wendong Liu*: Affiliation:
Jiangsu Provincial Center for Disease Control and Prevention, Nanjing, China
Qigang Dai: Affiliation:
Jiangsu Provincial Center for Disease Control and Prevention, Nanjing, China
Jing Bao: Affiliation:
Jiangsu Meteorological Service Center, Nanjing, China
Wenqi Shen: Affiliation:
Jiangsu Provincial Center for Disease Control and Prevention, Nanjing, China
Ying Wu: Affiliation:
Jiangsu Provincial Center for Disease Control and Prevention, Nanjing, China
Yingying Shi: Affiliation:
Jiangsu Provincial Center for Disease Control and Prevention, Nanjing, China
Ke Xu: Affiliation:
Jiangsu Provincial Center for Disease Control and Prevention, Nanjing, China
Jianli Hu: Affiliation:
Jiangsu Provincial Center for Disease Control and Prevention, Nanjing, China
Changjun Bao: Affiliation:
Jiangsu Provincial Center for Disease Control and Prevention, Nanjing, China
Xiang Huo*: Affiliation:
Jiangsu Provincial Center for Disease Control and Prevention, Nanjing, China
*: Author for correspondence: Wendong Liu, E-mail: jscdclwd@sina.cn; Xiang Huo, E-mail: Huox@foxmail.com
Author for correspondence: Wendong Liu, E-mail: jscdclwd@sina.cn; Xiang Huo, E-mail: Huox@foxmail.com

Article contents

Abstract
Introduction
Materials and methods
Results
Discussion
Conclusion
References

Rights & Permissions

Abstract

Influenza activity is subject to environmental factors. Accurate forecasting of influenza epidemics would permit timely and effective implementation of public health interventions, but it remains challenging. In this study, we aimed to develop random forest (RF) regression models including meterological factors to predict seasonal influenza activity in Jiangsu provine, China. Coefficient of determination (R2) and mean absolute percentage error (MAPE) were employed to evaluate the models' performance. Three RF models with optimum parameters were constructed to predict influenza like illness (ILI) activity, influenza A and B (Flu-A and Flu-B) positive rates in Jiangsu. The models for Flu-B and ILI presented excellent performance with MAPEs <10%. The predicted values of the Flu-A model also matched the real trend very well, although its MAPE reached to 19.49% in the test set. The lagged dependent variables were vital predictors in each model. Seasonality was more pronounced in the models for ILI and Flu-A. The modification effects of the meteorological factors and their lagged terms on the prediction accuracy differed across the three models, while temperature always played an important role. Notably, atmospheric pressure made a major contribution to ILI and Flu-B forecasting. In brief, RF models performed well in influenza activity prediction. Impacts of meteorological factors on the predictive models for influenza activity are type-specific.

Keywords

Forecast influenza activity meteorological factor random forest model

Type: Original Paper
Information: Epidemiology & Infection , Volume 147 , 2019 , e325

DOI: https://doi.org/10.1017/S0950268819002140 [Opens in a new window]
Creative Commons: This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited.
Copyright: Copyright © Jiangsu Provincial Center for Disease Control and Prevention 2019

Introduction

Seasonal influenza has always been a major public health problem [Reference Barnett1, Reference Iuliano2]. It annually causes tens of millions of respiratory illnesses and hundreds of thousand deaths worldwide [Reference Ginsberg3]. An accurate forecast of influenza activity in advance based on predictive models is crucial for public health authorities to predict the seasonal fluctuation and facilitate key response actions [Reference Chretien4, Reference Axelsen5], such as public health surveillance, deployment of emergency supplies and hospital resource management. However, accurate prediction remains a great challenge. A number of statistical approaches have been employed and evaluated. Random forest (RF) regression model was suggested to have enhanced prediction ability over the autoregressive integrated moving average (ARIMA), the generalized linear autoregressive moving average time series model [Reference Petukhova6, Reference Kane7] in context of animal influenza activity prediction. It performed better in identifying independent factors associated with H1N1pdm influenza infections over boosted regression trees, conventional and penalised logistic regression [Reference Mansiaux and Carrat8].

Meteorology plays an important role in the varied seasaonal patterns of influenza in temperate, subtroptical and tropical regions. Influenza activity has been reported to peak during rainy seasons in tropical climates and during dry, cold months of winter in temperate climates. The impact of climate conditions on influenza A and B could be different [Reference Iha9].

Influenza like illness (ILI) has been commonly used as the index of influenza activity worldwide [Reference Ginsberg3–Reference Axelsen5], however, a number of respiratory pathogenicities, including parainfluenza, adenovirus and rhinovirus, could cause ILI and thus, influence ILI activity fluctuation [Reference Huo10]. Recently, the positive rate of influenza virus in surveillance samples has been considered a more reliable indicator of influenza activity [Reference Shaman11,Reference Nishiura12].

Jiangsu Province is situated in the middle east coast of China and is a transitional district of warm temperate zone to subtropical zone. Researches conducted in this region could deliver a more comprehensive understanding of climate impact on influenza activity. In this study, we aim to develop RF models to predict the ILI activity, positive rates of Flu-A and Flu-B, respectively, which has been rare in published studies.

Materials and methods

Data sources

Surveillance of ILI and influenza virus in China is conducted through a national sentinel network [Reference Yu13], with sentinel sites covering 2.5% of all hospitals across the country. Data for patients fitting the definition of ILI (i.e. body temperature ⩾38 °C with a cough and/or a sore throat) is reported to the China Influenza Surveillance Information System (CISIS) on a weekly basis. In Jiangsu, for each sentinel site, no less than 20 nasopharyngeal swabs are collected in a week by convenience samples of ILI cases before antiviral therapy. These specimens are routinely tested for influenza virus subtypes using real-time fluorescent quantitative polymerase chain reaction (PCR) assay and the results are reported to CISIS within 48 h.

In this study, weekly data of ILI percentage in outpatients (ILI%) and influenza virus positive rate in Jiangsu during 2011–2016 were obtained from CISIS. The daily meteorological data were downloaded from China Meteorological Data Sharing Service System (http://cdc.cma.gov.cn) and aggregated into weekly data. These meteorological viriables include precipitation (PR), sunshine duration (SD), relative humidity (RH), atmospheric pressure (AP), minimum temperature (MIN_T), mean temperature (MEAN_T) and maximum temperature (MAX_T).

Random forest

Rf is an ensemble machine learning method proposed by Breiman [Reference Breiman14]. RF creates multiple classification and regression trees, each trained on a bootstrap sample of the original training data with a randomly selected subset of input variables. There are two parameters to choose when running a RF algorithm: the number of trees and the number of randomly selected variables. In regression, the tree predictor takes on numerical values as opposed to class labels used by the RF classifier. RF regression models take the average of outputs produced by the trees as the final prediction.

One of the most important features of RF is to calculate the variable's importance, which measures the association between a given variable and the prediction accuracy. RF regression approach discussed in this study uses the decrease in accuracy to assess the variable's importance. As suggested by previous studies about the good prediction capacity, we explored RF method in human seasonal influenza activity analyses, testing its forecasting and independent influence factors identifying performance.

Model evaluation

Data of 2011–2015 were split as a training set to fit the RF models, reserving 2016 as testing set to evaluate the predicting accuracy. Coefficient of determination (R²) and mean absolute percentage error (MAPE) were employed to evaluate the models' performance both in the model fitting stage and prospective forecasting stage. They were calculated as follows:

$$R^2 = \displaystyle{{\sum {{({\hat y}_i - \bar y)}^2}} \over {\sum {{(y_i - \bar y)}^2}}}$$

$$MAPE = \displaystyle{1 \over n}\sum\limits_{i = 1}^n {\displaystyle{{\vert {{\hat y}_i - y_i} \vert } \over {y_i}}} \times 100\%$$

where y _i means the i ^th observation, $\hat y_i$ means the i ^th predication, $\bar y$ means average of observations and n is the number of observations.

Statistics analysis

Descriptive statistics was used to illustrate the temporal pattern of ILI% and the influenza virus positive rate. Time series analysis methods were employed to identify the autoregressive order [15] of the dependent variables (i.e. ILI%, positive rate of Flu A and positive rate of Flu B). Cross correlation is a measure of association of a time series with another time series at different lags [Reference Wang, Wang and Chen16], which is essentially a univariate correlation method. In this study, cross correlation was used to determine the lag of climate variable that was most significantly associated with dependent variables. All the analyses in this study were completed using R version 3.5.0. Particularly, cross-correlation analyses were completed using the R package ‘TSA’. RF model fitting and forecasting were done in the R package‘randomForest’ [Reference Liaw and Wiener17].

Results

General description

More than 2 million ILI cases were reported to CISIS from the sentinel sites in Jiangsu province during the study period, with an average weekly ILI of 3.92%. Totally 146 236 throat swabs were sampled from the ILI cases. Influenza viruses were detected in 16 197 swabs through real time RT-PCR, reaching a general positive rate of 11.08%. According to the typing results, Flu-A and Flu-B accounted for 64.27% and 35.73% of all influenza positive samples, reaching an average positive rate of 7.12% and 3.96%, respectively. Two peaks were observed in the ILI activity and the positive rate of Flu-A in each year, one occurred in winter and the other in summer. While the positive rate of Flu-B just showed a winter peak in a year (Fig. 1). The features of the meteorological variables were summarised in Table 1.

Fig. 1. Temporal patterns of ILI activity and influenza virus positive rates in Jiangsu province, 2011–2016.

Table 1. Summary of weekly meteorological variables in Jiangsu province, 2011–2016

Correlation analysis

As shown in Table 2, AP and PR were significantly correlated with ILI% at lag 0–4. Mean_T and Min_T were also correlated with ILI% but with no lag effect. Max_T, RH and SD presented no relationship with ILI%. As to Flu-A, AP showed correlations at lag 0–3. The three temperature variables presented correlations at lag 0–4. All the meteorological factors were identified significant correlations with Flu-B at lag 0–4. The results of autocorrelation analysis were displayed in Fig. 2. ILI% presented autocorrelation at lag 1, while both Flu-A and Flu-B at lag 3.

Fig. 2. Partial autocorrelation function of time series ILI percentage, positive rate of Flu A and positive rate of Flu B.

Table 2. Cross correlation between dependent variable and meteorological factors

*statistically significant at 0.05.

RF model fitting and forecasting

Three RF models with optimum parameters were finally constructed to predict ILI activity, Flu-A and Flu-B positive rates in Jiangsu province, including 13, 23 and 39 predictors, respectively. The dependent variable of Flu-A had undergone a natural logarithmic transformation before the model fitting. See Table 3.

Table 3. Predictors in different models

The performance of the models is summarised in Table 4 and the predicting results are displayed in Fig. 3. The models for Flu-B and ILI% presented excellent performance both in model fitting stage and prospective forecasting stage, with MAPEs less than 10%. The model for Flu-A presented much worse than the other two, with MAPE up to19.49% in the test set. Nevertheless, the predicted values matched the real trend very well.

Table 4. Performance evaluation of different random forest models

Fig. 3. Plot of observed and predicted values via different models.

Variable importance

In each model, the lagged dependent variable was the most important of all predictors. The time variable presented as important in the models for ILI and Flu-A. Most of the meteorological factors and their lagged terms had the potential to improve the accuracy of the models to a certain degree, but their effects differed across the three models. For ILI forecasting, the weekly MEAN_T, AP and one order lagged AP were more important than the rest. For Flu-A, the lagged temperature specific variables were relatively important. With regard to Flu-B, the lagged AP and MAX_T presented greater effects than the other meteorological variables to improve the model accuracy. See Fig. 4.

Fig. 4. Variable importance in random forest regression models (just displaying the top 10 variables).

Discussion

Forecasting of influenza activity in human populations is crucial for influenza prevention and control [Reference Chretien4]. Many methods have been introduced for this purpose. As a conventional univariate model, ARIMA technique has been commonly used to forecast seasonal influenza surveillance at national, regional and local levels [Reference Paul18–Reference Wang20]. ARIMA model is virtually a linear method. It can achieve good predication when the variation contained in the data is relatively stable. In practice, however, the long-term trend and seasonality of influenza activity change over time, so that the ARIMA model cannot always reach a satisfactory result.

Substantial studies have proposed that influenza activity is climate-sensitive [Reference Tamerius21–Reference Shaman23]. Climatic factors may influence the survival and spread of influenza viruses in the environment, the host susceptibility and exposure probability [Reference Hemmes, Winkler and Kool24–Reference Lipsitch and Viboud26]. The effects of meteorological factors on epidemics of ILI have attracted considerable interest recently. Sudarat Chadsuthi, et al. [Reference Chadsuthi27] fitted ARIMA model with temperature and RH as covariates to forecast the incidence of influenza in Thailand. N'gattia1, et al. [Reference N'Gattia28] also developed ARIMA with meteorological variable rainfall to forecast influenza transmission. But the prediction accuracy of these models was not good enough and the climate variables did not clearly optimise the models. In this study, we employed RF algorithm fitting models to predict influenza activity with meteorological factors in Jiangsu province, China. In contrast with previous studies, we constructed predicting models not only for ILI but also for the positive rates of influenza virus (i.e. flu-A and flu-B). All the models performed very well in our dataset. Based on them, we can comprehensively and systematically evaluate the influenza activity in the future, which has significant and practical meaning for influenza prevention and control. Given the good performance of RF in influenza prediction, the models we established could be used for influenza (sub)type-specific early warning and to evoke early intervention. The key meteorological factors identified could be used for publicity, to elevate the general population's consciousness and engagement in influenza prevention.

Similar to many other members of the machine learning family such as artificial neural networks, RF model cannot explain the association between risk factors and influenza activity. But RF can assess the importance of each variable on the accuracy of prediction [Reference Breiman14, Reference Ong29], which is essential to optimise the model and may provide clues for the further study of influenza risk factors. In this study, we found that the lagged dependent variables (i.e. the proportion of ILI in the outpatients and positive rates of flu-A and flu-B) in the previous weeks were more important than meteorological factors in the models. It suggests that these models took advantage of the autocorrelation of the dependent variables. The influenza activity in Jiangsu province presented obvious seasonality which is a critical feature to fit predicting model. However, RF is unable to learn the seasonal patterns because of randomly selecting samples for each tree. In this study, we introduced a time variable into the models to fit the seasonal variance of ILI and positive rates of influenza viruses. The importance analysis shows that it played a significant role to improve the models. This strategy is worthy of reference when fitting the similar RF models. Compared with other multivariate predicting methods [Reference Chadsuthi27, Reference N'Gattia28], RF is not subject to multicollinearity, mainly due to randomly selecting variables for each tree in RF [Reference Ong29, Reference Carvajal30]. In this study, we selected predictor variables through cross-correlation analysis. The meteorological factors and their lagged terms were incorporated into the models so long as they were identified to be significantly correlated with the dependent variables. All of them presented some degree of importance, which suggested that the RF models comprehensively combined the climatic variables and their hysteresis effects. Furthermore, the importance of the meteorological factors differed across the three models, which may suggest that the influence of meteorological factors differs between ILI, flu-A and flu-B. The causes of this difference and its practical significance for influenza surveillance deserve further studies.

In this study, humidity and PR were not recognised as major meteorological factors related to ILI activity, positive rate of flu A and B, while the temperature was identified as the main driver. This is consistent with our previous study [Reference Dai31]. The present study also indicates that AP plays an important role in the activity of ILI and flu B. An increased influenza risk associated with rising AP was also reported in another subtropical region in China, using distributed lag nonlinear model [Reference Guo32].

Our study suggests that the selected meteorological variables contributed less to the fluctuations of ILI, flu A and B, compared with the effect of autocorrelation, which has been shown as the most important of independent variables. Monamele GC, et al. also supposed that meteorological parameters could only explain no more than 30% of the influenza activity variation [Reference Monamele33]. Although our constructed RF models showed desirable predictive ability, especially for ILI and flu B, more meteorological factors, such as specific humidity and absolute humidity, and population-specific immunity level [Reference Mansiaux and Carrat8] are warranted to be evaluated to improve the prediction of type/subtype-specific activity [Reference Pan34].

Conclusion

RF model is a good method to predict the influenza activity. Three RF models were constructed to predict the positive rate of influenza viruses and ILI incidence and performed very well. The autocorrelation and seasonal variation contained in the data of the dependent variables are crucial for the prediction models. Meanwhile, the effects of meteorological factors and cumulative effects over a period of time were combined to improve the models. Further researches are warranted to explore RF model with meteorological factors as well as other variables and it has the potential to be a useful tool for predicting other major infectious diseases.

Acknowledgements

This work was supported by Jiangsu Provincial Major Science & Technology Demonstration Project (No. BE2017749), Jiangsu Provincial natural science foundation (No. BK20151595), Jiangsu Provincial Medical Youth Talent (No. QNRC2016542, QNRC2016539), Preventive medicine research program (Y2018074) and Key Medical Discipline of Epidemiology (No. ZDXK A2016008).

Author contributions

WL, HX conceived and designed the experiments. WL, HX, JH, CB performed the experiments. WL, JB, KX, QD analysed the data. WL, YW, YS, WS contributed materials/analysis tools. WL, XH wrote the paper. All authors read and approved the final manuscript.

Ethical standards

According to the National Health Commission of China, infectious diseases surveillance was exempt from institutional review board assessment. The dataset was anonymised in the national reporting system (CISIS) except for individuals with special access and was anonymised before data analyses.

References

1.Barnett, R (2019) Influenza. Lancet 393, 396.CrossRef Google Scholar PubMed

2.Iuliano, AD et al. (2018) Estimates of global seasonal influenza-associated respiratory mortality: a modelling study. Lancet 391, 1285–1300.CrossRef Google Scholar PubMed

3.Ginsberg, J et al. (2009) Detecting influenza epidemics using search engine query data. Nature 457, 1012–1014.CrossRef Google Scholar PubMed

4.Chretien, JP et al. (2014) Influenza forecasting in human populations: a scoping review. PLoS ONE 9, e94130.CrossRef Google Scholar PubMed

5.Axelsen, JB et al. (2014) Multiannual forecasting of seasonal influenza dynamics reveals climatic and evolutionary drivers. PNAS 111, 9538–9542.CrossRef Google Scholar PubMed

6.Petukhova, T et al. (2018) Assessment of autoregressive integrated moving average (ARIMA), generalized linear autoregressive moving average (GLARMA), and random forest (RF) time series regression models for predicting influenza A virus frequency in swine in Ontario, Canada. PLoS ONE 13, e0198313.CrossRef Google Scholar PubMed

7.Kane, MJ et al. (2014) Comparison of ARIMA and random forest time series models for prediction of avian influenza H5N1 outbreaks. BMC Bioinformatics 15, 276.CrossRef Google Scholar PubMed

8.Mansiaux, Y and Carrat, F (2014) Detection of independent associations in a large epidemiologic dataset: a comparison of random forests, boosted regression trees, conventional and penalized logistic regression for identifying independent factors associated with H1N1pdm influenza infections. BMC Medical Research Methodology 14, 99.CrossRef Google Scholar

9.Iha, Y et al. (2016) Comparative epidemiology of influenza A and B viral infection in a subtropical region: a 7-year surveillance in Okinawa, Japan. BMC Infectious Diseases 16, 650.CrossRef Google Scholar

10.Huo, X et al. (2012) Surveillance of 16 respiratory viruses in patients with influenza-like illness in Nanjing, China. Journal of Medical Virology 84, 1980–1984.CrossRef Google Scholar

11.Shaman, J et al. (2013) Real-time influenza forecasts during the 2012–2013 season. Nature Communications 4, 2837.CrossRef Google Scholar PubMed

12.Nishiura, H (2011) Real-time forecasting of an epidemic using a discrete time stochastic model: a case study of pandemic influenza (H1N1-2009). BioMedical Engineering OnLine 10, 15.CrossRef Google Scholar

13.Yu, H et al. (2013) Characterization of regional influenza seasonality patterns in China and implications for vaccination strategies: spatio-temporal modeling of surveillance data. PLoS Medicine 10, e1001552.CrossRef Google Scholar PubMed

14.Breiman, L (2001) Random forests. Machine Learning 45, 5–32.CrossRef Google Scholar

15.Box GEP et al. (2008) Time Series Analysis: Forecasting and Control. Hoboken: New Jersey: John Wiley & Sons.Google Scholar

16.Wang, F, Wang, L and Chen, Y (2017) Detecting PM2.5's correlations between neighboring cities using a time-lagged cross-correlation coefficient. Scientific Reports 7, 10109.CrossRef Google Scholar PubMed

17.Liaw, A and Wiener, M (2002) Classification and regression by randomForest. R News 2, 5.Google Scholar

18.Paul, S et al. (2017) Modeling and forecasting influenza-like illness (ILI) in Houston, Texas using three surveillance data capture mechanisms. Online Journal of Public Health Informatics 9, e187.CrossRef Google Scholar PubMed

19.Song, X et al. (2016) Time series analysis of influenza incidence in Chinese provinces from 2004 to 2011. Medicine (Baltimore) 95, e3929.CrossRef Google Scholar PubMed

20.Wang, C et al. (2017) Epidemiological features and forecast model analysis for the morbidity of influenza in Ningbo, China, 2006–2014. International Journal of Environmental Research and Public Health 14, 559.CrossRef Google Scholar

21.Tamerius, J et al. (2011) Global influenza seasonality: reconciling patterns across temperate and tropical regions. Environmental Health Perspectives 119, 439–445.CrossRef Google Scholar PubMed

22.Jaakkola, K et al. (2014) Decline in temperature and humidity increases the occurrence of influenza in cold climate. Environmental Health 13, 22.CrossRef Google Scholar PubMed

23.Shaman, J et al. (2010) Absolute humidity and the seasonal onset of influenza in the continental United States. PLOS Biology 8, e1000316.CrossRef Google Scholar PubMed

24.Hemmes, JH, Winkler, KC and Kool, SM (1960) Virus survival as a seasonal factor in influenza and polimyelitis. Nature 188, 430–431.CrossRef Google Scholar PubMed

25.Lofgren, E et al. (2007) Influenza seasonality: underlying causes and modeling theories. Journal of Virology 81, 5429–5436.CrossRef Google Scholar PubMed

26.Lipsitch, M and Viboud, C (2009) Influenza seasonality: lifting the fog. PNAS 106, 3645–3646.CrossRef Google Scholar PubMed

27.Chadsuthi, S et al. (2015) Modeling seasonal influenza transmission and its association with climate factors in Thailand using time-series and ARIMAX analyses. Computational and Mathematical Methods in Medicine 2015, 436495.CrossRef Google Scholar PubMed

28.N'Gattia, AK et al. (2016) Effects of climatological parameters in modeling and forecasting seasonal influenza transmission in Abidjan, Cote d'Ivoire. BMC Public Health 16, 972.CrossRef Google Scholar PubMed

29.Ong, J et al. (2018) Mapping dengue risk in Singapore using random forest. PLOS Neglected Tropical Diseases 12, e0006587.CrossRef Google Scholar PubMed

30.Carvajal, TM et al. (2018) Machine learning methods reveal the temporal pattern of dengue incidence using meteorological factors in metropolitan Manila, Philippines. BMC Infectious Diseases 18, 183.CrossRef Google Scholar PubMed

31.Dai, Q et al. (2018) The effect of ambient temperature on the activity of influenza and influenza like illness in Jiangsu Province, China. Science of The Total Environment 645, 684–691.CrossRef Google Scholar PubMed

32.Guo, Q et al. (2019) The effects of meteorological factors on influenza among children in Guangzhou, China. Influenza and Other Respiratory Viruses 13, 166–175.Google Scholar PubMed

33.Monamele, GC et al. (2017) Associations between meteorological parameters and influenza activity in a subtropical country: case of five sentinel sites in Yaounde-Cameroon. PLoS ONE 12, e0186914.CrossRef Google Scholar

34.Pan, M et al. (2019) Association of meteorological factors with seasonal activity of influenza A subtypes and B lineages in subtropical western China. Epidemiology and Infection 47, e72.CrossRef Google Scholar