Friday, May 17, 2013

How do I account for effects of complex sampling design (design effects) when analyzing DAWN data?

The DAWN (Drug Abuse Warning Network) employs a two-stage (stratified cluster) sample design for the selection of hospital emergency department (ED) visits caused or contributed to by drugs. The DAWN public-use file (PUF) includes the following complex design variables: variance estimation stratum (STRATA), primary sampling unit (PSU), secondary sampling unit (REPLICATE), PSU frame size (PSUFRAME), and analysis case weight (CASEWGT). The DAWN PUF has no missing values in the design variables STRATA, PSU, PSUFRAME, REPLICATE, and CASEWGT; however, analysis variables can have missing values.

The default method for estimating standard errors is Taylor series linearization for SAS, SPSS, SUDAAN, Stata, and R software, but SAS can only account the variance contribution from the first-stage. SAS is not currently able to fully and properly account for the DAWN sampling design.

Example code/syntax specific to each statistical software package given below uses the DAWN 2010 PUF. The examples estimate the proportions (or means) of alcohol-related ED visits along with the standard errors (SE) and confidence intervals by racial group. In short, the statistical analysis plan (SAP) is to obtain the mean and its SE along with confidence intervals and other related statistics. Note that the variable ALCOHOL is coded as 0 and 1; thus, the mean of ALCOHOL is namely the proportion of ALCOHOL=1.

Users should read the help document (of her/his respective statistical package) regarding how missing values are being handled if any exist in the analysis variables.

The following SUDAAN stand-alone code uses all of the design variables provided in the PUF for appropriate calculation of the variance of means/proportions.

Proc descript design=WOR filetype=sasxport data="folder-path\dawn2010.xpt" notsorted;

nest STRATA PSU REPLICATE / MISSUNIT;
totcnt PSUFRAME _minus1_ _zero_;
weight CASEWGT;
class RACE;
table RACE;
var ALCOHOL;
print mean semean lowmean upmean
/style=nchs meanfmt=f10.8 semeanfmt=f10.8 lowmeanfmt=f10.8 upmeanfmt=f10.8;

The SPSS specific code for the same analysis is:

Get file='path\dawn2010puf.sav'.
CSPLAN ANALYSIS
/PLAN FILE='folder-path\dawn_stage2.csplan'
/PLANVARS ANALYSISWEIGHT= casewgt
/DESIGN STRATA= strata CLUSTER= psu
/ESTIMATOR TYPE = EQUAL_WOR
/POPSIZE VARIABLE=psuframe
/DESIGN CLUSTER=REPLICATE
/ESTIMATOR TYPE = WR.

CSDESCRIPTIVES
/PLAN FILE= 'folder-path\dawn_stage2.csplan'
/SUMMARY VARIABLES= alcohol
/SUBPOP TABLE= race
/MEAN
/STATISTICS SE CIN(95)
/MISSING SCOPE=ANALYSIS CLASSMISSING=EXCLUDE.

Stata specific code for the same analysis is (note that Stata commands are case sensitive):

use drivename:\path\statadatafilename.dta

gen STRATA2 = PSU  # svydescribe requires distinct naming of design variables in different stages
svyset PSU[pweight=CASEWGT], strata(STRATA) fpc(PSUFRAME) || REPLICATE, strata( STRATA2)
singleunit(centered)       # note: this line wraps from previous line
save drivename:\path\mystatadata.dta, replace

svy: mean ALCOHOL, over(RACE)
estat strata
estat effects, deff

The SAS code for this analysis, disregarding the second stage variance, is:
(with replacement at the first stage of sampling; ignoring the finite population correction)

proc surveymeans data=libname.dataSetName mean clm nomcar;
strata strata;
cluster psu;
weight casewgt;

domain race;
var alcohol;
run;

R specific code for the same analysis:

load("input-path-of-folder/34083-0001-Data.rda")  #ICPSR DAWN PUF 2010 (R data)
#da31921.001.rda is created in R console from this load command
dawn9 = da31921.0001 
rm(da31921.0001)    # to free up RAM, remove the full r data frame
keepvars = c("STRATA", "PSU", "REPLICATE", "PSUFRAME", "CASEWGT", "RACE" , "ALCOHOL")
mydat = dawn9[, keepvars]           #data with a subset variables

# ALCOHOL and RACE are factor variables; we need to make them numeric and removing value labels
library(prettyR)   # this command requests for the prettyR package to be installed
mydat$ALCOHOL <- as.numeric(sub("^\\(0*([0-9]+)\\).+$", "\\1", mydat$ALCOHOL))
mydat$RACE <- as.numeric(sub("^\\(0*([0-9]+)\\).+$", "\\1", mydat$RACE))
# to prepare the fpc variable for the 1st stage of the 2-stage variance estimation design

strpsu = unique(mydat[,c("STRATA","PSU")])
strpsu$one = 1
strpsu=aggregate(strpsu$one,by=list(strpsu$STRATA), FUN=sum, na.rm=TRUE)
library(reshape)   #to rename the function
strsample = rename(strpsu, c(Group.1="STRATA", x="PSUsample"))
strframe = unique(mydat[,c("STRATA","PSUFRAME")])
strframe = strframe[order(strframe$STRATA),]
str = merge(strframe, strsample, by=c("STRATA"))
str = transform( str, n.over.N = PSUsample / PSUFRAME )
str <- subset(str, select=-c(PSUFRAME))

mydat = merge(mydat, str, by=c("STRATA"))
rm(str, strframe, strsample, strpsu)
# survey design for the Taylor-series linearization method
library(survey)  # this command requests for the survey package to be installed
options(survey.lonely.psu = "adjust" )
# create a survey design object (desg) with the DAWN design information
mydat$zero=0    # for 2nd stage fpc
desg <- svydesign(id =~PSU + REPLICATE , strata =~STRATA + PSU, fpc =~n.over.N + zero, weights =~CASEWGT, data = mydat ,  nest = TRUE )
# to calculate the means or proportions of ALCOHOL by RACE:
out = svyby(~ALCOHOL, ~RACE, design = desg , FUN=svymean, vartype=c("se","ci"), na.rm = TRUE)
print(out)   # confidence intervals for 95%

Note that the design variables (except the sampling weight variable) do not affect first-order moment statistics (such as mean, proportion, percent, totals and weighted counts). However, the design variables must be used to produce the SE estimates of descriptive and inferential statistics.

The target population of hospitals was divided into disjoint strata (denoted as STRATA) such that every hospital in the population belongs to exactly one stratum. Hospitals are the PSUs. In the first stage of sampling, within each stratum, hospitals were selected by the simple random sampling (SRS) without replacement for finite populations. Stratum size (denoted by PSUFRAME) is the population count of hospitals (i.e., PSUs) contained in the stratum. Depending upon the year and the size of the hospital, either all REPLICATEs were selected, which resulted in no contribution to the second stage variance for that hospital, or second-stage sampling at rates as small as 1/3 was performed using systematic random sampling. This resulted in sample files that averaged about 300,000 visits per year. Tabulation and analysis of these large sample files when many variables were involved proved to be computationally very expensive due to the amount of time necessary to calculate exact estimates of the second stage variance. In multi-stage sampling, the greatest portion of the variability occurs in the first stage; thus, the second-stage contribution is usually a small fraction of the first-stage variance. Realizing that unacceptable amounts of time were required to calculate the estimates of a small fraction of total variance, an approximation method was adopted that made variance estimation for large tables with many variables practical. The sampled visits within each hospital were sorted into two groups, and the between group variance was calculated and substituted for the standard second-stage variance. DAWN research simulations showed that the above approximation is sufficient for DAWN precision and efficient for DAWN standard error calculation. Note that STRATA is the first stage and PSU is the second stage stratification variable in the sample. Proper use of the REPLICATE variable should account for within-hospital variation, i.e., the second stage variance. The final analysis weight variable, CASEWGT, was adjusted for unequal probability of selection, nonresponse of hospitals, and coverage bias of ED visits to the benchmark population totals of ED visits from American Hospital Association (AHA) database.

CASEWGT is a poststratification adjusted analysis weight variable. By using CASEWGT, the weighted estimates of counts or totals obtained for any of the survey variables are the estimates for the entire universe of DAWN-eligible hospitals in the United States. Poststratification adjustments were implemented within the design strata to offset the coverage bias on the estimates. That means post-strata were not constructed from any of the ED visit characteristic variables. Therefore, the software packages correctly take into account the design effects on the variance estimates of estimated counts or totals for variables related to DAWN visits. For further details on sample design and weighting adjustment, please see the DAWN Methodology Report.

For additional technical information on options and how the various statistical software handle issues with missing values and degrees of freedom, please see the FAQ: Technical issues of sampling design analysis of DAWN data.