December 4, 2023


Future Depends on What You Do

Health benefits from risk information of air pollution in China

Urban protection data and inference

We designed a questionnaire to obtain the cognitive and protective behaviors data of different regions and groups about air pollution. After a strict quality control (including deleting some samples with obvious logical errors, missing data, and inconsistent addresses), we finally received 1072 valid questionnaires (see Supplementary Figs. 2–6 for the initial statistical information of some important indicators in the questionnaire). This study was approved by the Ethics Committee of the Beijing Institute of Technology (No. 22-1-103). All procedures performed in this study were in accordance with the 1964 Helsinki declaration and its later amendments or comparable ethical standards. All participants are allowed to fill in the questionnaire only when they understand the purpose of the survey and agree to the publication of the research results. And, online informed consent was obtained from all participants.

The settings of the core variables are as follows:

  • ATTRi: The attention ratio (ATTRi) is the proportion of people in different groups i (such as region, gender, and age) who pay attention to air pollution information. This data represents the statistical values of all samples in the survey questionnaire. For each respondent, we will inquire about the frequency of their daily attention to air pollution. There are 5 options for this question, with frequencies ranging from lowest to highest being most no, occasionally, generally, often, and most every day. When respondents with a frequency of often or above are marked as 1, otherwise it is 0. The group marked as 1 is considered to be concerned about air pollution information. In this way, by aggregating different groups, we can calculate the proportion of people in different groups who pay attention to air pollution information.

  • MRi, CODRi, and ACRi: The three variables are whether they will wear masks or cancel going out in the air polluted weather (not pandemic period), and whether they have air purification equipment in the workplace and residential areas. If the answer to each question is “Yes,” select 1, otherwise 0. These variables are also used according to the ratio formed after group aggregation: the rate of group i wearing masks (MRi), canceling going out (CODRi), and having indoor air purification equipment (ACRi, the average of the rate of air purification equipment in workplaces and residential areas).

  • ODRi: The proportion of outdoor activity time is mainly to investigate the average daily outdoor activity hours of individuals during the non-epidemic period, and then to calculate the outdoor activity proportion (ODRi) of group i.

To extrapolate the questionnaire results to all prefecture-level cities, we introduced transfer learning method into our work (see Supplementary material S3). The idea of transfer learning is to use the similarity of data, task type, or models to apply the models and knowledge learned in the old fields to the new fields. Including the problems and data in this paper, the final required prediction results are calculated as follows:

Step 1 Align provincial statistical characteristic data (source domain) with urban characteristic data (target domain) by CORAL algorithm42:

$$\beginarraycD_\mathrms=\left[F_s1^T,F_s2^T\dots F_sm^T\right]_m\times n\endarray$$


$$\beginarraycD_\mathrmt=\left[F_t1^T,F_t2^T\dots F_tm^T\right]_m\times k\endarray$$


$$\beginarraycC_s=\Sigma _s+eye\left(m\right)\endarray$$


$$\beginarraycC_t=\Sigma _t+eye\left(m\right)\endarray$$




Equations (1) and (2) represent the feature datasets of the source domain and target domain, respectively; \(F_m^T\) is the mth feature of the dataset, where the source domain feature data are provincial statistical data from China Statistical Yearbook 202043, and the target domain feature data are urban statistical data from China Urban Statistical Yearbook 202044. The source domain and target domain have the same type of statistical indicators, including 18 indicators in the fields such as economics, environment, education, and population structure. As these indicators differ greatly at the city level and provincial level, we divide all indicators by the total population of the current region to obtain the per capita value of each indicator so that the characteristic scales of the source domain and target domain are the same.

Step 2 Use the transformed source domain data to establish a supervised machine learning model and train it and use the trained model to predict the city-level variables.

The model architecture is shown in Fig. 2. \(D_\mathrms^new\) is the feature of input data that includes the five variables, which are the five tasks’ goal of training model, respectively. We selected four machine learning models as our candidate models: random forest, Lasso regression, Ridge regression, and support vector machine. These models are simple and efficient in structure, and their easy-to-use regularization technology limits the occurrence of overfitting. In the training process, the grid search method is used to automatically select the best super parameter for each task’s model. The fivefold cross-validation method is used to verify the accuracy of each model. Then, we select the model with the best performance in each task, and finally predict the corresponding variables of each city with city-level dataset (\(D_\mathrmt\)).

Figure 2
figure 2

Model training and prediction process. RF, Random Forest; SVR, support vector machine.

According to the cross-validation and test results of the model, the validity and accuracy of our model are established (see Supplementary Material S5 and Table 1). Considering age, gender, and urban and rural groups, we used the total original questionnaire to calculate the variables of each group (see of Supplementary Material S5 and Table 2).

Calculation of equivalent PM2.5

This research refers to the integrated population weighted exposure (IPWE) model created by Shen et al.45 and enhances it accordingly. The IPWE model distinguishes between household air pollution (HAP) and outside ambient air pollution (AAP) and incorporates people’s activity patterns into the model. We added outdoor PM2.5 permeability and people’s protective behavior led by risk information to the model (see Supplementary Material S4) and developed the I-BEPEM to assess people’s real PM2.5 exposure concentration.

Equation 6 expresses the I-BEPEM model based on the previous assumptions. The urban attention ratio and the protective behavior ratio are obtained from the prediction results of Section “Urban protection data and inference”, and both follow the \(N(\mu _i, \theta ^2)\) distribution. \(\mu _i\) is the indicator’s forecast data for city i, and \(\theta\) is the indicator’s standard deviation. \(\mathrmpm_i,t\) represents the average concentration of PM2.5 in city i on day t. This indicator is derived from the data of over 2000 monitoring sites for surface air quality in China’s Ministry of Ecology and Environment46. The air quality index for city i on day t is denoted by \(AQI_i,t\). \(IEPE_i\) is the annual equivalent comprehensive PM2.5 exposure value for city i. \(threshold\) is the AQI value at which the air quality level of “lightly polluted” is reached. \(DM\) is the mask’s protective effect or the PM2.5 attenuation rate after being filtered by the mask. The protective effect conforms to the Chinese government-issued group standard F9053 for “PM2.5 protective masks”47. According to Xiang et al.48, \(DH_i\) represents the protective impact of buildings in various areas or the attenuation rate of PM2.5 in the outer environment when it penetrates a room. \(DAC\) is the purification efficiency of air purification equipment, or the rate of PM2.5 concentration attenuation after air purification equipment has cleansed indoor air. This information is derived from the existing relevant measured data49,50,51,52. We consider the mean of these studies as the decay rate value. To ensure uncertainty, we assume that all types of decay rate data have a normal distribution, with the mean serving as their survey or reference value (see Supplementary Material S2 and for the corresponding variance settings).

$$\left\{\beginarrayl\beginarraylODR_i=ODR_i*(1-CODR_i*ATTR_i)\\ MR_i=MR_i*ATTR_i\\ ACR_i=ACR_i*ATTR_i\\ IEPE_AAP,i=\frac1T\left\{\beginarrayl\sum _t=1^T\mathrmpm_i,t*ODR_i*\left(MR_i*DM+1-MR_i\right), if\,AQI_i,t>Threshold\\ \sum _t=1^T\mathrmpm_i,t*ODR_i,\,else\endarray\right.\\ IEPE_HAP,i=\frac1T\left\{\beginarrayl\sum _t=1^T\mathrmpm_i,t*\left(1-ODR_i\right)*DH_i*\left(ACR_i*DAC+1-ACR_i\right), if\,AQI_i,t>Threshold\\ \sum _t=1^T\mathrmpm_i,t*\left(1-ODR_i\right)*DH_i, else \endarray\right.\\ IEPE_i= \, IEPE_AAP,i+IEPE_HAP,i\text.\endarray\endarray\right.$$


Table 2 displays the settings for several indicators for scenarios S0–S5. “Yes” indicates that the actual value of the indicator should be maintained. The values 0 and 1 denote the setting index value. “No” indicates that the indicator is not considered. According to our survey results, residents generally refer to the overall air quality level, rather than being limited to the AQI value of PM2.5. Residents are only likely to take protective measures when the air pollution level reaches “light polluted” (AQI > 100) or above. Both China and the United States regard the highest AQI value of all pollutants at each moment as the current overall AQI value and designate it as the primary pollutant41,53. According to the overall AQI value, the current air quality is divided into six levels: excellent, good, lightly polluted, moderately polluted, heavily polluted, and severely polluted. The difference is that when the PM2.5 concentration is less than 150 μg/m3 and PM2.5 is the primary pollutant, China’s AQI value may be lower than that of the United States (see Supplementary Material S10). Therefore, we map the Chinese air quality level to the new air quality level and AQI value based on the PM2.5 level in the US standard. In summary, we will use 100 as the threshold for AQI in our model. The protection level parameter for Beijing residents is set to Column S5 with the subscript BJ.

Table 2 Indicator settings in different scenarios.

Premature death estimation

This study mainly uses the IER model developed by Burnett et al. and GBD 2019 disease data to estimate PM2.5-related premature death. IER model is widely recognized and uses PM2.5 concentration-related premature death risk estimation model54, and its calculation method is shown in Eq. 7.

$$\beginarray*20l {RR_IER \left( z \right) = \left\{ {\beginarray*20l {1 + \alpha \left( {1 – e^ – \gamma \left( z – z_cf \right)^\delta } \right),} & if z > z_cf \\ 1, & else \\ \endarray .} \right.} \\ \endarray$$


Among them, z represents the annual mean equivalent PM2.5 concentration calculated for each city in Section “Calculation of equivalent PM2.5”. \(z_cf\) is the minimum PM2.5 concentration with additional risk.\(\alpha\), \(\gamma\), and \(\delta\) are computed by fitting this equation. This paper focuses primarily on the four major causes of premature PM2.5 mortality, namely ischemic heart disease (IHD), stroke, chronic obstructive pulmonary disease (COPD), and lung cancer (LC). The \(z_cf\), \(\alpha\), γ, and \(\delta\) parameter values corresponding to the above four diseases are from Institute for Health Metrics and Evaluation (IHME). Each disease contains 1000 sets of parameter simulations. The final calculation method of PM2.5-related premature death for each city is shown in Eq. 8:

$$\beginarraycAC_i,k=\fracRR_i,k-1RR_i,k\times B_k\times P_i,\endarray$$


where \(AC_i,k\) and \(RR_i,k\) are the number of PM2.5-related additional deaths and the relative risk of disease k in the ith city or group, respectively. \(B_k\) is the basal incidence of disease k, which is from GBD 20194. \(P_i\) is the total population of the city or group i. To obtain interval estimates of PM2.5-related premature death, 1000 Monte Carlo simulations were performed for all parameters.

Reduction amount of premature death and distribution of environmental risk information

Weibo (China’s equivalent to Twitter) and Baidu Index are the two main sources of ERI. Sina Weibo is the largest open social networking platform in China. It was founded in 2009 and had 450 million monthly active users and 250 million daily active users by 201855. Baidu is the largest search engine in China. Through distributed crawler technology, the public application program interfaces (APIs) of these two platforms were searched for content containing environment-related keywords, as shown in Supplementary Table 3. After information extraction, cleaning, and conversion, approximately 2.3 million original microblogs related to the environment were obtained. These microblogs were forwarded approximately 140 million times and more than 30 million people participated in the discussion during 2013–2020. In addition to the Weibo data, we received the daily-level search index data for 294 cities during 2013–2020 as a supplement. We used all environment-related Weibo reposts and originals from different regions and Baidu search index as the total distribution of regional environmental information. Equation (9) defines the per capita access to ERI:



where \(W_i,t\) and \(B_i,t\) are the total number of original and reposted environment-related microblogs and the search index in city i at time t, respectively. The time range is \([T+1, T+t]\). \(P_i\) is the total population of city i.

The relationship between the reduction of premature death and the distribution of ERI is shown in Eq. (10).

$$\beginarraycDDP10k_i=\beta \cdot ERI_i+\sum _k\gamma _kX_k,i.\endarray$$


\(DDP10k_i\) is the PM2.5-related premature deaths reduced by active protection per 10,000 people in city i. \(X_k,i\) denotes the kth covariate of the ith city. All variables are log transformed. \(\gamma _k\) is the coefficient of the kth covariate; β is our target coefficient, representing the percentage change in \(DDP10k_i\) for every 1% change in ERI.