PM2.5 and Lung Cancer (Revisited)

For this Byte, I will rediscover the relationship between the concentration of in-air particulate matter (PM2.5) and the mortality rate of lung and bronchus cancer. In my previous Byte, I showed that there is a "moderate correlation between PM2.5 measurements and cancer mortality rate," and remarked that my argument could be improved since the granularity of my location data was by State instead of County.

For this byte, I found another dataset with per-county statistics on lung and bronchus cancer. This allows me to study the relationship between particulate matter and lung cancer occurence with much better granularity. It is widely agreed in academia that higher concentration of particulte matter leads to higher occurence of cancer, and the scatter plot with regression line created with D3.js will test the viewpoint.

Visualization of PM2.5 Measurements & Cancer Mortality Data

For ease of interpretation, the results of PM2.5 measurements are normalized to AQI (air quality index) by EPA. The higher the value, the worse the pollution. In the diagram below, darker colors mean higher AQI.

Note: Due to the large volume of data, the maps may take a while to load completely. In the mean time, you may zoom in and click on individual data points to view detailed results.

The heatmap below shows the concentration of PM2.5 in general areas. The color spectrum of green to red are applied for AQI of 80-197. It is important to note that even green color on the map already corresponds to "moderate" air quality, and areas in red correspond to "unhealthy" air quality (source). Thankfully, the maximal AQI value in the U.S. dataset is merely 197, unlike downtown Beijing where PM2.5 AQI can reach 150+ regularly (source).

The following diagram shows the mortality rate of lung and bronchus cancer in Year 2015 by state. This was used in my previous byte and has poor granularity, since AQI can vary greatly across a state, e.g. San Francisco compared to San Diego in California.

Quite surprisingly, California has some regions with high PM2.5 concentration, yet overall the state has low lung cancer mortaility rate. Meanwhile, states in mid-west (including Tennessee, Kentucky, Missouri, and West Virginia) suffer from the highest lung cancer mortaility rate, despite the relatively low PM2.5 AQI's in those areas. My guess is that healthcare system in mid-west is weaker than, say, California and New York, and thus the mortality rate is higher in general.

In order to normalize my data and improve its granularity, I am adopting the per-county lung cancer mortality data from this point, and the dataset looks likes the following:

Last two diagrams about lung cancer mortality show that states in mid-west not only have the highest lung cancer mortality count, but also that lung cancer is the most common cancer in those regions. Note that I normalized the data by total cancer mortality, and that the raw data is already age-corrected, we can safely draw the conclusion that mid-west indeed suffer from lung cancer the most among the United States.

Lung Cancer Mortality Rate vs. PM2.5 AQI

In this section, we study the correlation between lung cancer and PM2.5 directly by presenting a scatter plot for the two variables. The value for lung cancer rate is the percent of lung cancer for that county; the value for per-county PM2.5 is calculated as the median AQI values of the county. The raw data for the scatter plot is as follows:

The resulting scatterplot is as follows (with D3.js). You may pan and zoom on the diagram if you drag a region of the chart with your mouse.

By observing the scatter plot above, we can note that there is a mild linear correlation (r=0.2599) between the two variables, yet the linear factor for the regression is 0.24 (close to the "20%" figure claimed in academia). This confirms that there is non-trivial correlation between the concentration of PM2.5 and the mortality caused by lung cancer. Meanwhile, it is important to notice the fact that PM2.5 pollution causes mostly mid- to long-term effect to human health, and thus observing the PM2.5 data and cancer mortality data in the same time frame might not lead to very interesting conclusions as we hoped for (e.g. a stronger linear correlation).

If you would like to discover the relationship between the geo-coordinates, lung cancer fatality, and PM2.5 API, you may play around with the following diagram, which displays pairwise correlation between the variables. You may drag a region on the diagram in order to filter by a pair of variables.

Appendix: PM 2.5 Data Exploration by County

For the curious: You can enter county name and examine the raw PM2.5 datapoints provided by EPA! The entries are ranked by AQI, and only Top 500 results are displayed to reduce server load. Requests can take up to 10 seconds.

City name: (Try Allegheny, San Francisco, and San Diego!)

State NameCounty NameCity NameAQISite NumMethod Name