Applied Statistics final project using SAS to complete an epidemiology analysis of CDC Chronic Disease Indicators in US compared with the state of Georgia. Note, we are not very healthy in the state of Georgia.
1 of 29
More Related Content
Analysis of CDC Chronic Disease Indicators US compared with Georgia
2. STAT 3010.W01 Final Project: Analysis of Center for Disease Control
Chronic Disease Indicators of the United States and Georgia for Year 2005
The aim of this report is to discuss the results of a statistical analysis of Chronic Disease
Indicators of the United States and Georgia for the year 2005 made by the Center for Disease
Control. The points covered in the analysis of data were: 1) Determine descriptive statistics and
describe the distributions of variables of the data set, 2) Compare chronic disease indicator rates
between the United States and Georgia separately for each of five categories, and 3) Create a
random 20 item sample from the dataset, then estimate the Chronic Disease Indicator rate in the
United States and Georgia using a 95% and 99% confidence interval, then determine whether or
not the population mean rate for all 50 initial data were captured by the estimated confidence
intervals. SAS 9.1.3 SP4 and graphics from SAS and Minitab 15 were the applications used in
this analysis.
The particular dataset was chosen due relation to healthcare, size, and complexion of data. The
five variables (three catagorical and two quantitative) of the Center for Disease Control Chronic
Disease Indicators of the United States and Georgia for Year 2005 were obtained by filtering a
data set from the Center for Disease Control website (http://apps.nccd.cdc.gov/cdi/Default.aspx).
A comparison was selected between the United States and Georgia. The data and definitions
were originally developed by The Council of State and Territorial Epidemiologists with
epidemiologists and chronic disease program directors at the state and federal level, were refined
between 1999 and 2002, then a survey was made for 2005.
This data has proved useful in Georgia to develop a database of the indicators by 19 health
districts available via the internet. As well, the Division of Diabetes Translation at Center for
Disease Control uses the data to assist diabetes programs with their surveillance and
epidemiological activities. Table 1 shows a short selection of the data, and variable names used
in Table 1 are described in Table 2. There are 50 datapoints from the year 2005, and the six other
datapoints from different years were trimmed from the data set before analysis. Therefore, results
and analysis is only valid for the year 2005.
The occurrences per 100,000 people of the United States, and Georgia, by Chronic Disease
Indicator category are assessed.
The assessment of the quantitative and categorical variables shows the following. Table 3 shows
the descriptive statistics for Chronic Disease Indicators of the United States and Georgia both
have a significant difference between the mean and median. Figures 1 and 2 clearly show that the
distribution of occurrences for the United States, and for Georgia, are both unimodal, and
positively skewed. Figures 3 and 4 further demonstrate this trend. Although drasticly skewed, no
outliers are shown. The most representative measure of central tendency is the median, 25.95 for
the United States, and 25.90 for Georgia.
Table 4 shows the frequency of each occurrence by category. Cancer swallows up the data at 36
occurrences (out of 50), this mode is over four times that of the next leading indicator,
Cardiovascular Disease. Figures 5 and 6 reinforce this, however, it is notable that cancer has a
broader range of results,and is skewed, but Cardiovascular Disease has a more even distribution.
A new categorical variable was created for the occurrences in the United States and Georgia
based on size. The occurrences were broken up into chunks of size 150. The Contingency Table
5 shows that Cancer statistics for the United States are mostly returned in the X-Small range,
meaning that most of the 32 data points in this category were less than 150 occurrences.
3. Occurrences for Georgia differ in that some results fall into the Medium range, and 50% of the
Cardiovascular results are from the X-Small category.
The categorical indicator is also show in Figures 7 through 10. They stress again that Cancer is
the leading indicator, by far, at over 75% overall. Figure 7 clumps the smallest three indicators
into one category, Other. The breakdowns of cause by either United States, or Georgia,
continue to stress the facts that Cancer and Cardiovascular Disease are the factors that beg
further study.
Tables 11 and 12 again show the breakdown of occurrences by the newly created variable, size.
Each show that most occurrences for both the United States, and for Georgia, fall into the X-
Small category, at a frequency of nearly 40% in each. Tables 13 and 14 show the category of
incidence by size on stacked bar charts for the United States and Georgia. Cancer results in the
United States fit mostly the X-small category, and Cardiovascular fit the Small category.
The results in Geogia show that X-Small leads in all categories, and is the vast majority of the
Cancer indicator.
Finally, a random sample was produced in SAS of 20 data points. Both the 95 and 99%
confidence intervals captured the true sample means with the United States between 38.69 and
200.55 (95%), and 9.00 and 230.24 (99%), where the true mean is 102.33, and Georgia between
35.69 and 210.84 (95%), and 3.56 and 242.97 (99%), where the true mean is 100.37
4. APPENDIX I: SAS TABLES AND FIGURES
Table 1: Abbreviated Display of the Center for Disease Control
Chronic Disease Indicators of the United States and Georgia for Year 2005
Obs CATEGORY INDICATOR YEAR MEASURE UNITED_STATES GEORGIA
1 Tobacco and Alcohol Chronic liver disease - mortality 2005 Crude Rate 9.3 7.5
2 Tobacco and Alcohol Chronic liver disease - mortality 2005 Age-adjusted 8.9 8.1
Rate
3 Cancer Invasive cancer (all sites combined) - 2005 Crude Rate 469.8 402.6
incidence
4 Cancer Invasive cancer (all sites combined) - 2005 Age-adjusted 458.4 452.0
incidence Rate
5 Cancer Cancer (all sites combined) - mortality 2005 Crude Rate 188.6 157.2
. . . . . . .
. . . . . . .
. . . . . .
48 Overarching Conditions Premature mortality among adults aged 2005 Age-adjusted 618.6 711.1
45-64 years Rate
49 Other Diseases and Risk Asthma - mortality 2005 Crude Rate 1.3 1.3
Factors
50 Other Diseases and Risk Asthma - mortality 2005 Age-adjusted 1.3 1.5
Factors Rate
NOTE: The data for other years were minimal and thus eliminated from this data set (the numeration Obs was added automatically by SAS).
5. Table 2: Summary of Variables Contained in Center for Disease Control
Chronic Disease Indicators of the United States and Georgia for Year 2005
Measurement
Variable Name Label General Type Specific Type
Units
Observation
Obs Categorical Identifier Variable N/A
number
CATEGORY Disease category Categorical Nominal N/A
INDICATOR Disease indicator Categorical Nominal N/A
Survey year
YEAR Categorical Nominal N/A
(only 2005 used)
Crude or Age
MEASURE Categorical Nominal N/A
adjusted rate
Number of
instances per
UNITED_STATES - Quantitative Interval/Ratio
100,000
persons*
Number of
instances per
GEORGIA - Quantitative Interval/Ratio
100,000
persons*
* standardized by the direct method to the year 2000 standard U.S. population
based on single years of age from the Census P25-1130 series estimates
6. Table 3: Descriptive Statistics of Center for Disease Control
Chronic Disease Indicators of the United States and Georgia for Year 2005
Variable N Mean Median Std Dev Range Minimum Maximum
UNITED_STATES 50 102.23 25.95 153.70 628.60 1.30 629.90
GEORGIA 50 100.37 25.90 163.10 719.70 1.30 721.00
Table 4: Frequency Table of Center for Disease Control
Chronic Disease Indicators by Category
CATEGORY
Cumulative Cumulative
CATEGORY Frequency Percent Frequency Percent
Cancer 36 72.00 36 72.00
Cardiovascular Disease 8 16.00 44 88.00
Other Diseases and Risk Factors 2 4.00 46 92.00
Overarching Conditions 2 4.00 48 96.00
Tobacco and Alcohol 2 4.00 50 100.00
7. Figure 1: Histogram of Occurrences United States (per 100,000 people)
70
60
50
P
40
e
r
c
e
n
t 30
20
10
0
0 120 240 360 480 600
UNITED STATES
Figure 2: Histogram of Occurrences Georgia (per 100,000 people)
70
60
50
P
40
e
r
c
e
n
t 30
20
10
0
0 120 240 360 480 600 720
GEORGIA
8. Figure 3: Box Plot of Occurrences United States (year 2005 per 100,000 people)
800
600
U
N
I
T
E
D
400
S
T
A
T
E
S
200
0
2005
YEAR
Figure 4: Box Plot of Occurrences Georgia (year 2005 per 100,000 people)
800
600
G
E
O
R 400
G
I
A
200
0
2005
YEAR
9. Figure 5: Side by Side Box Plot of Occurrences United States (per 100,000 people)
800
600
U
N
I
T
E
D
400
S
T
A
T
E
S
200
0
Cancer Tobacco and Alcohol
CATEGORY
Figure 6: Side by Side Box Plot of Occurrences Georgia (per 100,000 people)
800
600
G
E
O
R 400
G
I
A
200
0
Cancer Tobacco and Alcohol
CATEGORY
10. Table 5: Contingency Table Category of Occurrences by United States Size
CATEGORY(CATEGORY) US_SIZE
Frequency
Percent
Row Pct
Col Pct Large Small X-Large X-Small Total
Cancer 2 2 0 32 36
4.00 4.00 0.00 64.00 72.00
5.56 5.56 0.00 88.89
100.00 25.00 0.00 84.21
Cardiovascular Disease 0 6 0 2 8
0.00 12.00 0.00 4.00 16.00
0.00 75.00 0.00 25.00
0.00 75.00 0.00 5.26
Other Diseases and Risk Factors 0 0 0 2 2
0.00 0.00 0.00 4.00 4.00
0.00 0.00 0.00 100.00
0.00 0.00 0.00 5.26
Overarching Conditions 0 0 2 0 2
0.00 0.00 4.00 0.00 4.00
0.00 0.00 100.00 0.00
0.00 0.00 100.00 0.00
Tobacco and Alcohol 0 0 0 2 2
0.00 0.00 0.00 4.00 4.00
0.00 0.00 0.00 100.00
0.00 0.00 0.00 5.26
Total 2 8 2 38 50
4.00 16.00 4.00 76.00 100.00
12. Figure 7: Pie Chart Category of Occurrences (per 100,000 people)
13. Figure 8: Pie Chart Category of Occurrences United States (per 100,000 people)
14. Figure 9: Pie Chart Category of Occurrences Georgia (per 100,000 people)
15. Figure 10: Bar Chart of Category of Occurrences (per 100,000 people)
16. Figure 11: Bar Chart of Category of Occurrences United States (per 100,000 people)
FREQUENCY
40
30
20
10
0
Large Small X- Large X- Small
US_ SIZE
Figure 12: Bar Chart of Category of Occurrences Georgia (per 100,000 people)
FREQUENCY
40
30
20
10
0
Large Medi um Small X- Large X- Small
GA_ SIZE
17. Figure 13: Stacked Bar Chart of Category of Occurrences
United States (per 100,000 people)
18. Figure 14: Stacked Bar Chart of Category of Occurrences
Georgia (per 100,000 people)
Table 7: 95 and 99% Confidence Intervals for United States and Georgia 20 set Sample
Lower 95% Upper 95%
Variable Label N CL for Mean CL for Mean
UNITED_STATES UNITED STATES 20 38.69 200.55
GEORGIA GEORGIA 20 35.69 210.84
Lower 99% Upper 99%
Variable Label N CL for Mean CL for Mean
UNITED_STATES UNITED STATES 20 9.00 230.24
GEORGIA GEORGIA 20 3.56 242.97
19. Appendix II: Figures Generated in Minitab
Figure 15
Hi st ogr am of UNI TED_ STATES
25
20
Fr equency
15
10
5
0
0 160 320 480 640
occur r ences/ 100k
Figure 16
Hi st ogr am of GEORGI A
30
25
20
Fr equency
15
10
5
0
0 100 200 300 400 500 600 700
occur r ences/ 100k
20. Figure 17
Boxpl ot of UNI TED_ STATES
700
600
500
occur r ences/ 100k
400
300
200
100
0
Figure 18
Boxpl ot of GEORGI A
800
700
600
occur r ences/ 100k
500
400
300
200
100
0
21. Figure 19
Boxpl ot of UNI TED_ STATES by CATEGORY
700
600
occur r ences/ 100k
500
400
300
200
100
0
r e s s l
ce as or n ho
an i se ct tio o
C D Fa n di A
lc
ar sk o d
ul Ri C an
sc d ng o
va an hi cc
di
o s rc ba
ar se ra To
C sea ve
Di O
e r
th
O
CATEGORY
Figure 20
Boxpl ot of GEORGI A by CATEGORY
800
700
occur ences/ 100k
600
500
400
300
200
100
0
er e rs ns ol
nc as to io oh
Ca se Fa
c
di
t
Al
c
r Di k on d
la s C an
cu Ri ng o
s d i cc
va an ch
ba
io s ar
rd se er To
Ca sea Ov
Di
er
th
O
CATEGORY
22. Figure 21
Pi e Char t of CATEGORY
Category
Cancer
4.0% Cardiovascular Disease
4.0%
4.0% Other Diseases and Risk Factors
Overarching Conditions
Tobacco and Alcohol
16.0%
72.0%
Figure 22
Pie Char t of CATEGORY f or UNI TED_ STATES
Category
Cancer
0.3%
Cardiovascular Disease
Other Diseases and Risk Factors
24.5% Overarching Conditions
Tobacco and Alcohol
47.5%
0.0%
27.6%
23. Figure 23
Pi e Char t of CATEGORY f or GEORGI A
Category
Cancer
0.3%
Cardiovascular Disease
Other Diseases and Risk Factors
Overarching Conditions
28.6% Tobacco and Alcohol
45.3%
0.0%
25.7%
Figure 24
Bar Char t of CATEGORY
40
30
Count
20
10
0
er e rs ns ol
nc as to io oh
Ca se Fa
c
di
t
Al
c
r Di k on d
la s C an
cu Ri ng o
s d i cc
va an ch
ba
io s ar
rd se er To
Ca s ea Ov
Di
er
th
O
CATEGORY
24. Figure 25
Bar Char t of Unit ed St at es Si ze
40
30
Count
20
10
0
Large Small X-Large X-Small
US_ SI ZE
Figure 26
Bar Char t of Geor gi a Si ze
40
30
Count
20
10
0
Large Medium Small X-Large X-Small
GA_ SI ZE
25. Figure 27
St acked Bar Char t of CATEGORY by Unit ed St at es Si ze
40 US_SI ZE
X-Small
X-Large
30
Small
Large
Count
20
10
0
CATEGORY er e rs ns ol
nc as to io oh
Ca ise Fa
c di
t
Al
c
rD sk on d
la Ri
C an
scu d i ng o
va an ch cc
io s ar ba
rd se er To
Ca isea Ov
rD
he
Ot
Figure 28
St acked Bar Char t of CATEGORY by Geor gi a Si ze
40 GA_SI ZE
X-Small
X-Large
30
Small
Medium
Count
Large
20
10
0
CATEGORY er e rs ns ol
nc as to io oh
Ca se c di
t
Al
c
r Di Fa on
la sk C an
d
cu Ri ng o
s d i cc
va an ch
io s ar ba
rd se er To
Ca is ea Ov
rD
he
Ot
26. Appendix III: SAS Code
* FULLERTON, STAT 3010.W01, FINAL PROJECT: DATA ANALYSIS OF Center for Disease Control Chronic
Disease Indicators (CDC - CDI) of the United States and Georgia for Year 2005;
* SETTING SYSTEM OPTIONS;
DM 'LOG;CLEAR;OUT;CLEAR;';
OPTIONS LS=100 PS=75 FORMDLIM="=";
QUIT;
* Loading previously saved data set;
DATA NEWCDICDC;
SET 'V:final.projectCDICDC';
RUN;
* Saving the data as a permanent SAS data set;
DATA CDICDC;
SET 'V:final.projectCDICDC';
RUN;
* To view data in SAS;
PROC PRINT DATA = CDICDC;
RUN;
* SETTING LIBREF;
* Saving data as a permanent SAS data set;
LIBNAME W2 'V:final.project';
DATA W2.CDICDC;
SET CDICDC;
RUN;
* IMPORT CDC - CDI DATA;
PROC IMPORT
DATAFILE = 'V:final.projectFilChrDisIndCDC.xls'
OUT = T1
REPLACE;
RUN; QUIT;
* Variable View in SAS;
PROC CONTENTS DATA = W2.CDICDC;
RUN;
* Table 1 Dataset;
ODS RTF;
PROC PRINT DATA = W2.CDICDC;
VAR CATEGORY INDICATOR YEAR MEASURE UNITED_STATES GEORGIA;
RUN;
ODS RTF CLOSE;
27. * Descriptive Statistics for Quantitative Variables;
ODS RTF;
PROC MEANS DATA = W2.CDICDC MAXDEC=2 N MEAN MEDIAN STD RANGE MIN MAX;
VAR UNITED_STATES GEORGIA;
RUN;
ODS RTF CLOSE;
* Frequency Tables of Category Variables;
ODS RTF;
PROC FREQ DATA = W2.CDICDC;
TABLES CATEGORY INDICATOR MEASURE;
RUN;
ODS RTF CLOSE;
* Histograms and Boxplots;
DM 'LOG; CLEAR; OUT; CLEAR;';
PROC UNIVARIATE DATA = W2.CDICDC;
VAR UNITED_STATES GEORGIA;
HISTOGRAM;
RUN;
PROC SORT DATA = W2.CDICDC;
BY YEAR;
PROC BOXPLOT DATA = W2.CDICDC;
PLOT UNITED_STATES*YEAR;
PLOT GEORGIA*YEAR;
RUN;
* Boxplot of Occurrences by Category;
DM 'LOG; CLEAR; OUT; CLEAR; GRAPH; CLEAR';
PROC SORT DATA = W2.CDICDC;
BY CATEGORY;
PROC BOXPLOT DATA = W2.CDICDC;
PLOT UNITED_STATES*CATEGORY;
PLOT GEORGIA*CATEGORY;
RUN;
* Creating new variable (size) for contingency table analysis;
DM 'LOG;CLEAR;OUT;CLEAR';
DATA T1;
SET T1;
LENGTH US_SIZE $ 7;
IF UNITED_STATES < 145 THEN US_SIZE = 'X-Small';
IF (UNITED_STATES GE 145) AND (UNITED_STATES < 300) THEN US_SIZE = 'Small';
IF (UNITED_STATES GE 300) AND (UNITED_STATES < 450) THEN US_SIZE = 'Medium';
IF (UNITED_STATES GE 450) AND (UNITED_STATES < 600) THEN US_SIZE = 'Large';
IF (UNITED_STATES GE 600) THEN US_SIZE = 'X-Large';
SET T1;
LENGTH GA_SIZE $ 7;
IF GEORGIA < 145 THEN GA_SIZE = 'X-Small';
IF (GEORGIA GE 145) AND (GEORGIA < 300) THEN GA_SIZE = 'Small';
IF (GEORGIA GE 300) AND (GEORGIA < 450) THEN GA_SIZE = 'Medium';
IF (GEORGIA GE 450) AND (GEORGIA < 600) THEN GA_SIZE = 'Large';
IF (GEORGIA GE 600) THEN GA_SIZE = 'X-Large';
28. PROC PRINT DATA = T1;
RUN;
* Contingency Tables;
DM 'LOG;CLEAR;OUT;CLEAR';
ODS RTF;
PROC FREQ DATA = T1;
TABLES CATEGORY*US_SIZE;
RUN;
ODS RTF CLOSE;
ODS RTF;
PROC FREQ DATA = T1;
TABLES CATEGORY*GA_SIZE;
RUN;
ODS RTF CLOSE;
* Pie Charts;
PROC GCHART DATA = W2.CDICDC;
PIE CATEGORY;
GOPTIONS HTEXT = 1;
LEGEND;
RUN;
QUIT;
PROC GCHART DATA = W2.CDICDC;
PIE CATEGORY / SUMVAR = UNITED_STATES PERCENT = INSIDE;
GOPTIONS HTEXT = 1;
LEGEND;
RUN;
QUIT;
PROC GCHART DATA = W2.CDICDC;
PIE CATEGORY / SUMVAR = GEORGIA PERCENT = INSIDE;
GOPTIONS HTEXT = 1;
LEGEND;
RUN;
QUIT;
* Bar Charts;
PROC GCHART DATA = W2.CDICDC;
VBAR CATEGORY / TYPE = FREQ;
GOPTIONS HTEXT = 1;
LEGEND;
RUN;
PROC GCHART DATA = T1;
VBAR US_SIZE / TYPE = FREQ;
GOPTIONS HTEXT = 1;
LEGEND;
RUN;
PROC GCHART DATA = T1;
VBAR GA_SIZE / TYPE = FREQ;
GOPTIONS HTEXT = 1;
LEGEND;
RUN;
29. * Stacked Bar Charts;
PROC GCHART DATA = T1;
VBAR CATEGORY / SUBGROUP = US_SIZE;
GOPTIONS HTEXT = 1;
LEGEND;
RUN;
PROC GCHART DATA = T1;
VBAR CATEGORY / SUBGROUP = GA_SIZE;
GOPTIONS HTEXT = 1;
LEGEND;
RUN;
* Generate Random sample set of data with seed to replicate data;
DATA CDICDCN;
SET W2.CDICDC;
GROUP = RANUNI(123456);
PROC PRINT DATA = CDICDCN;
RUN;
* Sort random data to show only the first 20 observations;
PROC SORT DATA = CDICDCN;
BY GROUP;
DATA CDICDCNN;
SET CDICDCN;
IF _n_ < 21;
PROC PRINT DATA = CDICDCNN;
RUN;
* Confidence Intervals on ratio scale variables;
DM 'LOG;CLEAR;OUT;CLEAR;';
ODS RTF;
PROC MEANS DATA = CDICDCNN MAXDEC=2 N CLM ALPHA = .05;
VAR UNITED_STATES GEORGIA;
RUN;
PROC MEANS DATA = CDICDCNN MAXDEC=2 N CLM ALPHA = .01;
VAR UNITED_STATES GEORGIA;
RUN;
ODS RTF CLOSE;
* Export Data to Minitab;
PROC EXPORT
OUTFILE = 'V:final.projectFilChrDisIndCDC.csv'
DATA = W2.CDICDC
REPLACE;
RUN;
PROC EXPORT
OUTFILE = 'V:final.projectFilChrDisIndCDCT1.csv'
DATA = T1
REPLACE;
RUN;
QUIT;