Collecting and Cleansing AQI Data

Collect and cleanse hourly AQI of five cities of China (where U.S. embassy locations: Beijing, Shanghai, Guangzhou, Shenyang, Chengdu) from 2008 to 2014.

Link to raw and cleansed dataset:Baidu Cloud.


  • Data Source: stateair.net
  • Create Date: 20150406
  • Updata Date: 20150407

Defination of the dataset and introduction can be downloaded from this link: (http://www.stateair.net/web/historical/1/1.html)

Data use statement (http://www.stateair.net/web/historical/1/1.html)

Files:

  • City_PM25_From_To.csv
    CSV dataset of cleansed data of city
  • USEmbassyAQI.csv
    CSV dataset with five cities combied
  • USEmbassyAQI.RData
    In Rdata format
  • Raw/*
    Raw dataset

Cleansing Process

  • Delete the first three rows of text.
  • Delete the “Parameter” column, since all have the same value “PM2.5”.
  • Delete “DATE(LST)” column, keep “Year”, “Month”, “Day” and “Hour” columns.
  • Delete “Unit” column, since all have the same value “ug/m3”.
  • Delete “QC Name”, since all are blank.
  • Transfer “Duration” column, which takes two values of “Valid” and “Missing” to a boolean column “Valid”.

Update 20150407

There are 17 pairs of duplicates (same location and time), all of which are at 3:00 am. The duplicates have different values of AQI and the 2:00 am records are missing for all the cases, so correct the first appearance of 3:00 am as 2:00 am record. The following data is adjusted:

Row Site Year Month Day Hour Value Valid
6675 Beijing 2009 3 8 3 179 TRUE
15579 Beijing 2010 3 14 3 75 TRUE
24315 Beijing 2011 3 13 3 351 TRUE
50523 Beijing 2014 3 9 3 55 TRUE
59355 Chengdu 2012 3 11 3 -999 FALSE
68091 Chengdu 2013 3 10 3 176 TRUE
76827 Chengdu 2014 3 9 3 65 TRUE
85683 Guangzhou 2011 3 13 3 -999 FALSE
94419 Guangzhou 2012 3 11 3 -999 FALSE
103155 Guangzhou 2013 3 10 3 92 TRUE
111891 Guangzhou 2014 3 9 3 23 TRUE
120747 Shanghai 2011 3 13 3 -999 FALSE
129483 Shanghai 2012 3 11 3 89 TRUE
138219 Shanghai 2013 3 10 3 107 TRUE
146955 Shanghai 2014 3 9 3 55 TRUE
155739 Shenyang 2013 3 10 3 -999 FALSE
164475 Shenyang 2014 3 9 3 53 TRUE

Moreover, there are records with value -999 but not marked as invalid in Shanghai, such as:

Row Site Year Month Day Hour Value Valid
127905 Shanghai 2012 1 5 9 49 TRUE
127906 Shanghai 2012 1 5 10 49 TRUE
127907 Shanghai 2012 1 5 11 -999 TRUE
127908 Shanghai 2012 1 5 12 -999 TRUE
127909 Shanghai 2012 1 5 13 50 TRUE
127910 Shanghai 2012 1 5 14 50 TRUE

The two records in the middle is invalid but not properly marked. There are totally 296 cases like this that are adjusted by marking “Valid=FALSE”.

Wang, Guansong
Wang, Guansong
Data Worker

My research interests include distributed robotics, mobile computing and programmable matter.