This dataset includes the research areas of PM, feature importance, spatial and temporal patterns 2.5 and PM10, as well as uncertainty in estimating annual mortality rates. In this work, a simple structured, efficient, and robust model based on LightGBM was developed to fuse multi-source data and estimate India's long-term (1980-2022) historical daily ground PM concentration (LongPMIn). The LightGBM model showed good accuracy with R2 values of 0.77, 0.70, and 0.66 in out of sample, out of field, and out of year cross validation (CV) tests, respectively. The performance gap between PMs is small, and 2.5 training and testing (delta RMSE of 1.06, 3.83, and 7.74 micrograms m-3) indicate a low risk of overfitting. Has strong generalization ability, can publicly access, long-term, high-quality daily PM2.5 and PM10, and then reconstruct the product (10 kilometers, 1980-2022). This indicates that India has experienced severe PM pollution in the Indian Ganges Plain (IGP), especially during winter. Since 2000, PM concentrations have significantly increased in most regions. The turning point occurred in 2018, when the Indian government launched the National Clean Air Program, resulting in a decrease in PM2.5 concentrations in most areas. Severe PM2.5 pollution has led to a continuous increase in attributed premature mortality rates, rising from 0.73 (95% confidence interval (CI) [0.65, 0.80]) in 2000 to 1.22 (95% CI [1.03, 1.41]) in 2019, particularly in IGP where attributed mortality rates increased from 360000 to 600000. LongPMIn has the potential to support various applications in air quality management, public health initiatives, and climate change response.
| collect time | 1980/01/01 - 2022/12/31 |
|---|---|
| collect place | India |
| data size | 8.5 GiB |
| data format | nc |
| Coordinate system |
The ground observations of PM2.5 and PM10 during the period of 2018-2022 in India were collected from the CPCB air quality monitoring network( https://www.cpcb.nic.in ). Due to the impact of extreme values on model robustness, the bottom and top 0.01% of observed data were excluded. The fifth generation ECMWF atmospheric reanalysis dataset ERA5 Land covering the years 1980-2022 was used. Select features based on their relative importance, which is calculated based on their gain and includes several meteorological factors with higher relative importance. We also collected data products from the Modern Research and Application Retrospective Analysis Second Edition (MERRA-2) covering the years 1980-2022, including aerosol optical thickness and aerosol composition and precursors (black carbon, organic carbon, sulfates, dust, and sulfur dioxide).
Model construction: In this study, LightGBM is an efficient gradient boosting decision tree (GBDT) used to estimate PM2.5 and PM10, and the grid search cross validation (CV) method is used to select the optimal hyperparameters. Designed a hyperparameter selection algorithm (algorithm S1 in the supplement) to ensure the generalization ability of the model. Execute a loop to increase model complexity, and then end the loop and return hyperparameters when the RMSE predicted by the model does not significantly decrease or the difference between the RMSE trained and predicted does not significantly increase. Select features based on their relative importance. 10 meteorological features, 6 emission related features, and total aerosol extinction are used to train LightGBM and estimate PM concentration.
MERRA-2 is a global air pollution reanalysis dataset released and maintained by NASA; It has been widely used in PM pollution research in the Indian region, and its reliability has been extensively analyzed (Gueymard and Yang, 2020; Navinya et al., 2020; Buchard et al., 2017). For MERRA-2 AOD, evaluations using AERONET observations indicate that MERRA-2 performs better than Copernicus Atmospheric Monitoring Service (CAMS) in most regions (Gueymard and Yang, 2020). Kumar et al. (2023) predicted the concentration of ground PM2.5 in India using only MERRA-2 and machine learning methods, demonstrating the reliability of MERRA-2 data.
This work is licensed under a
Creative
Commons Attribution 4.0 International License.
| # | title | file size |
|---|---|---|
| 1 | _ncdc_meta_.json | 6.9 KiB |
| 2 | code.zip | 483.1 KiB |
| 3 | pm25_arr_10km_1980_daily.nc | 195.5 MiB |
| 4 | pm25_arr_10km_1980_monthly.nc | 6.4 MiB |
| 5 | pm25_arr_10km_1981_daily.nc | 194.9 MiB |
| 6 | pm25_arr_10km_1981_monthly.nc | 6.4 MiB |
| 7 | pm25_arr_10km_1982_daily.nc | 194.9 MiB |
| 8 | pm25_arr_10km_1982_monthly.nc | 6.4 MiB |
| 9 | pm25_arr_10km_1983_daily.nc | 194.9 MiB |
| 10 | pm25_arr_10km_1983_monthly.nc | 6.4 MiB |
| # | category | title | author | year |
|---|---|---|---|---|
| 1 | paper | Reconstructing long-term (1980--2022) daily ground particulate matter concentrations in India (LongPMInd) | S,Wang,M,Zhang,H,Zhao,P,Wang,S,H,Kota,Q,Fu,C,Liu,H,Zhang | 2024 |
©Copyright 2005-. Northwest Institute of Eco-Environment and Resources, CAS.
Donggang West Road 320, Lanzhou, Gansu, China (730000)

