About ...

If you want to dig a little deeper, you are in the right place

More Information

About the Data

The United States Centers for Disease Control maintains a database of behavioral health risk factors survey data. This data is freely available for download. Numerous visualizations are also available.

Most geographic visualizations either capture current values on a chloropleth or heat map, or they capture trends using a line chart. Sometimes, though, it is useful to see both the spatial and temporal trends in one graphic, which is why this visualization tool was made.

For this tool, ten specific indicators were chosen based on the recorded summary of CDC BRFSS survey responses. Eight of these indicators come directly from specific categories directly computed by the CDC and made available as part of the data.

In addition, two new categories were used for this visualization. For body mass index, the CDC uses a 5-category response that records overweight and obesity prevalence as separate categories. For this dataset, both categories were combined into a single overweight / obese category.

The raw data tracks the portion of the population who have ever had their cholesterol checked, and the portion of the population who have ever had their cholesterol checked *and* been told that it was high. For this visualization, the prevalence of high cholesterol was estimated simply by taking the fraction who had been told their cholesterol was high and dividing by the fraction of people ever tested. Using this approach as a quantitaive estimate implies that these two factors are independent, which may not be true. As a qualitative indicator, though, it seems reasonable.

The specific questions and answers used were:

# Question ID Question Text Answer IDs How Answer Was Used
1 _HCVU651 Adults aged 18-64 who have any kind of health care coverage RESP046 "Yes" (+CDC calculation) "Yes" / Total as estimated by CDC
2 BLOODCHO Adults who have ever had their blood pressure checked RESP046 "Yes" "Yes" / Total as estimated by CDC
3 _TOTINDA During the past month, did you participate in any physical activities? RESP046 "Yes" "Yes" / Total as estimated by CDC
4 DRNKANY5 Adults who have had at least one drink of alcohol within the past 30 days RESP046 "Yes" (CDC calculation) "Yes" / Total as estimated by CDC
5 _RFSMOK Adults who are current smokers (per CDC calculation) RESP046 "Yes" "Yes" / Total as estimated by CDC
6 _RFHLTH Health Status RESP061 ("Good" or better) (CDC calculated) "Good" or better / Total as estimated by collected
7 Custom (based on _BMI5CAT) Body Mass Index 5-category "RESP039" and "RESP040" (CDC calculations) Body Mass Index > 25.0 / Total (based on sum of 25.5-29.9 and 30.0+ categories)
8 DIABETE3 Have you ever been told by a doctor that you have diabetes? RESP046 "Yes" "Yes" / Total as estimated by CDC
9 _RFHYPE5 Adults who have been told they have high blood pressure RESP046 "Yes" (+CDC calculation) "Yes" / Total as estimated by CDC
10 Custom (_RFCHOL & BLOODCHO) Adults who have been told blood cholesterol was high / Adults who have ever had it checked Custom (based on multiple RESP046 "Yes") _RFCHOL "Yes" / Total divided by BLOODCHO "Yes" / Total

About the Statistics

Please note that the statistical methods used here are reasonable choices for "exploratory grade" analysis. They are not "publication grade" and not intended for publication and peer review (sometimes data can simply be informative and -- academics, avert your eyes! -- fun).

The reason for the "exploratory" grade involve two simplifications made with the data. First, the complex sample design used to get population estimates by CDC was not taken into account. When, for instance, it is reported that about 15% of respondents in Austin were current smokers, this does not mean that 15% of survey respondents answered yes. About 15% answered yes, but then the CDC adjusted the proportion to take into account the characteristics of the sample, in order to estimate what the responses would have been for a perfectly representative sample. If interested, pleae feel free to read all about it on the CDC website.

What the above means for statistics is that the sampling errors are not what one would normally calculate for a given number of responses. In many cases, the CDC has helpfully provided 95% confidence intervals, but from that we can only approximate the expected variance in the measurements. Which leads us to simplification #2: for this analysis, the response data is assumed to be normally distributed. These data were collected from surveys, often with several hundred to several thousand responses, so the calculated variance in responses based on the binomial distribution should be well-approximated by a normal distribution. The complex sampling design, however, may introduce unexpected features in the distribution. If you want to do publication-quality statistics, you should read the CDC's published papers (or ask them about it).

Where this matters most is in the custom categories (overweight and high cholesterol prevalence). For these categories, the error bars you see are 95% confidence intervals based on assumptions of normality as well as independence of the component measurements and a disregard of the complex sampling design. To see how much this matters, I compared the CDC's own multi-category derived error estimates with ones I computed using the above simplifications. The results agreed within about 10%, but did not coincide perfectly. Thus, the error bars on those two categories should be treated as approximate. Otherwise, the error bars correspond to the 95% confidence limits reported by the CDC, and as long as you simply take them at face value and don't try to further manipulate them for statistical purposes, they should be reliable.

Thanks!

If you're still reading after all that, then you probably like data as much as I do. So you may be interested in knowing (fi you don't already) that the CDC is a great place to get your hands on clean, free data sets that can be quite massive. In addition to thanking them, I would also like to thank Michelle Chandra, whose bl.ocks.org US map came in handy, as well as Trang Tran, whose d3-colorbar also also added a nice touch. The GeoJSON source data (from the US census bureau) came via the helpful site by Eric Celeste, with some helpul hints from a helpful article by Florian Ledermann.

The data munging was done wtih Python3 in a Jupyter Notebook using Pandas, SciPy.stats and the linear regression models from scikit-learn, and the visualization was done with a big chunk of d3.js (v5). Hint: pure d3 is great on your own time because it will pay you back with better graphics skills, but if someone else is paying by the hour, use Plotly.js if you can! Finally, I couldn't have been good enough with these tools to consider doing all this as a quick project without Danny Clifford and his fellow instructors at the UC Berkeley Data Analytics program. If you like Excel but would love to take it to the next level quickly, it's worth the effort!

Last updated 06-20-2019 by Andrew Guenthner. Released under the MIT Open Source License, with some elements governed by specific licenses. Please consult the project documentation on Github for details.