National Survey on Drug Use and Health (NSDUH)1 employs a multistage (stratified cluster) sample design for the selection of a representative sample from non-institutional members of United States households aged twelve and older. The NSDUH public-use file (PUF) includes the variance estimation variables (which were derived from the complex sample designs2): variance estimation stratum (VESTR), variance estimation cluster replicates (VEREP) and final analysis weight (ANALWT_C). VEREP is nested within the VESTR. It is therefore considered that the complex survey method for NSDUH PUF is a single-stage stratified clustering design, where the clusters are sampled with replacement (WR). There are no missing values in the variance estimation variables and final analysis weight, VESTR, VEREP and ANALWT_C. However, analysis variables can have missing values.
SUDAAN, all survey procedures in SAS, Stata, R and the survey add-on module in SPSS can handle data from complex sampling designs. The WR design is the default design, except in SPSS and Taylor series linearization is also the default method for variance estimation of them. Note that users should read the help document (of her/his respective statistical package) regarding how missing values are being handled if any exist in the analysis variables.
SAMPLE SYNTAX
Using analysis weights is important to get the point estimates right. Users must consider the weighting, clustering, and stratification of the survey design to produce correct standard errors (and degrees of freedom). The example code provided below shows how to specify these variables correctly, using an individual year of the NSDUH PUF, and also indicates how to calculate the proportions, standard errors (SE), and confidence intervals of the risk of smoking one or more packs of cigarettes per day by gender. This statistical analysis plan (SAP), in turn, results in two subpopulation analyses of proportions for each level of gender. The dependent or outcome variable is the risk of smoking one or more packs of cigarettes per day and is determined using the categorical variable, RSKPKCIG. Gender is determined using the categorical variable, IRSEX. Both of the variables in the NSDUH PUF file are numeric in downloadable SAS, Stata, SPSS, and R specific datasets. RSKPKCIG is coded numeric as 1 to 4 for no risk, slight risk, moderate risk, and great risk for valid values and as system missing for invalid values. IRSEX is imputation revised gender for missing values and is coded numeric as 1 for male and 2 for female.
For analysis of the NSDUH PUF file, one should consider three important things before preparing program code in a statistical software package:
- How to correctly specify the variance estimation variables including analysis weights;
- The statistical procedure along with requested statistics; and
- The domains of analysis, if any.
Each of these three considerations is discussed below. [Note the following conventions for wording in program syntax: upper case codes are statements/procedures, upper case italics are option keywords in software packages, and upper case bolded codes are variables from the input dataset.]
Specify variance estimation variables. For variance estimation, each sample program code uses the Taylor linearization method for this example SAP. The WR design method is the default with the Taylor method for all but SPSS software packages. The stratification and clustering of the complex sample design in the NSDUH PUF are described by specifying the variance estimation variables (and also the analysis weight) via the statements in the analysis procedure program code for SUDAAN and SAS software packages. For example, "NEST VESTR VEREP /MISSUNIT; WEIGHT ANALWT_C;" in SUDAAN and "STRATA VESTR; CLUSTER VEREP; WEIGHT ANALWT_C;"in SAS. Note that the order of the variables in the NEST statement is important. One should use the above block of statements (specific to NSDUH complex design) in any survey procedure program code for any SAP. For example, PROC SURVEYLOGISTIC in SAS and PROC LOGISTIC in standalone SUDAAN.
The above can be implemented by the SVYSET command in Stata as "SVYSET VEREP[pweight =ANALWT_C], STRATA(VESTR) SINGLEUNIT(centered)". SPSS requires an analysis plan design file for the data file to perform a complex survey analysis. The first block of code in the SPSS program syntax, below, for the CSPLAN ANALYSIS procedure will create such an analysis plan file.
See the "How are complex sampling variances computed by Taylor linearization method?" FAQ on the use of MISSUNIT option in SUDAAN, SINGLEUNIT(centered) option in Stata, and the NOMCAR procedure option in SAS.
Statistical procedures and requested statistics. In the SAP, RSKPKCIG is the analysis variable with four valid and missing values and IRSEX will be used to define the domains or subpopulations. [In general, the goal is to get results in proportions (not in percentages) of a (outcome or dependent) variable for multiple subpopulations.] Stata's SVY: PROPORTION produces estimates of proportions, along with SEs, for character or numeric variables. SAS's SURVEYMEANS always analyzes character variables as categorical. So, RSKPKCIG can be declared as a character variable by specifying it in both the CLASS and VAR statements to obtain the estimates of the proportions, along with SEs. R’s svyby procedure with the factor (variable) function and FUN=svymean argument produces the mean, SEs and confidence intervals, in which the factor function converts a variable to a set of dummy variables, while SUDAAN's DESCRIPT procedure and SPSS's CSDESCRIPTIVES method compute means, along with SEs, for only continuous variables. The mean estimate of a 0/1 coding dummy variable is essentially a proportion estimate. Therefore, the four dummy variables (RSK1, RSK2, RSK3, and RSK4) that were created for the RSKPKCIG variable also contain missing values. SUDAAN’s DESCRIP and SPSS’s CSDESCRIPTIVES procedures can be used for our desired analysis by obtaining means (i.e., proportions), along with SEs, for the four dummy variables of RSKPKCIG. [Note that using SVY: MEAN in Stata and SURVEYMEANS (no CLASS statement) in SAS, the same analysis of means can be obtained for a list of indicator variables of a categorical variable.]
Domains of Analysis. The CLASS and TABLES statements in SUDAAN, OVER command in Stata, DOMAIN statement in SAS and SUBPOP TABLE statement in SPSS specify the multiple subpopulations/domains for a variable (e.g., two for IRSEX) to which analyses are to be performed. For example, the following SAS program code with "DOMAIN IRSEX;" statement will output the analysis results for the variables in VAR statement for two domains; one for the male (IRSEX=1) population and the other for the female (IRSEX=2) population.
SUDAAN standalone syntax:
[The input data file that is used for this example is in the SAS Export format. SUDAAN recommends that an input dataset is sorted by the variables in the NEST statement; otherwise, NOTSORTED option must be specified in the PROC statement.]
PROC DESCRIPT filetype=SASXPORT data="path\nsduh2011.xpt" NOTSORTED;
NEST VESTR VEREP / MISSUNIT;
WEIGHT ANALWT_C;
CLASS IRSEX;
TABLES IRSEX;
VAR RSK1 RSK2 RSK3 RSK4;
PRINT mean semean lowmean upmean
/style=nchs meanfmt=f6.3 semeanfmt=f6.3 lowmeanfmt=f7.3 upmeanfmt=f7.3;
Stata specific code for the same analysis:
use drivename:\path\nsduh2011.dta
svyset VEREP[pweight=ANALWT_C], strata(VESTR) singleunit(centered)
save drivename:\path\nsduh2011svy.dta, replace
[It is a good practice to save the survey setting permanently in the data file. This allows for this saved data to be used for any subsequent survey analysis.]
use drivename:\path\nsduh2011svy.dta
SVY: PROP RSKPKCIG, OVER(IRSEX)
SAS code for this analysis:
PROC SURVEYMEANS data=sasdata NOMCAR mean clm;
STRATA VESRT;
CLUSTER VEREP;
WEIGHT ANALWT_C;
DOMAIN IRSEX / DFADJ;
CLASS RSKPKCIG;
VAR RSKPKCIG;
run;
The SPSS specific code for the same analysis:
[The example code assumes that the user's SPSS software package has the complex survey module installed. Select the first block of code below, copy, and then paste it into the SPSS Syntax Editor. Replace the ‘folder-path' by providing the path location where you would like to save the nsduh11.csplan xml file. Select and run this modified syntax code in the Syntax Editor. CSPLAN will write this Analysis Design into the nsduh11.csplan xml file. The second block of code shows how to reference the nsduh11.csplan xml file in the CSDESCRIPTIVES and other Complex Survey procedures in the current and future SPSS sessions.]
CSPLAN ANALYSIS
/PLAN FILE='folder-path\nsduh11.csplan'
/PLANVARS ANALYSISWEIGHT= ANALWT_C
/DESIGN STRATA= VESTR CLUSTER= VEREP
/ESTIMATOR TYPE = WR.
get file='path\nsduh2011.sav'.
CSDESCRIPTIVES
/PLAN FILE= 'folder-path\nsduh11.csplan'
/SUMMARY VARIABLES= RSK1 RSK2 RSK3 RSK4
/SUBPOP TABLE= IRSEX DISPLAY=LAYERED
/MEAN
/STATISTICS se cin(95)
/MISSING SCOPE=ANALYSIS CLASSMISSING=EXCLUDE.
R specific sample code for the same analysis:
load("folder-path/nsduh2011.rda")
keepvars = c("VESTR", "VEREP", "ANALWT_C", "IRSEX", "RSKPKCIG" )
nsduh11 = nsduh2011[, keepvars] #make a data file with fewer variables
library(prettyR) # requires to install the prettyR package
nsduh11$RSKPKCIG<- as.numeric(sub("^\\(0*([0-9]+)\\).+$", "\\1", nsduh11$RSKPKCIG))
nsduh11$IRSEX <- as.numeric(sub("^\\(0*([0-9]+)\\).+$", "\\1", nsduh11$IRSEX))
library(survey) #needs to install the survey package
options( survey.lonely.psu = "adjust" )
desg <- svydesign(id = ~VEREP , strata = ~VESTR , weights = ~ANALWT_C , data = nsduh11 , nest = TRUE )
# calculate the means or proportions of RSKPKCIG by the levels of IRSEX and their SEs
out = svyby(~factor(RSKPKCIG), ~IRSEX, design = desg , FUN=svymean, vartype=c("se","ci"), na.rm = TRUE)
coef(out) #extracting the means
SE(out) #extracting the SEs of means
confint(out) # 95% confidence intervals of mean
print(out) #all results
Note that the variance estimation variables (except the sampling weight variable) do not affect the mean, proportion, percent, and other first-order moment statistics. For example, in SUDAAN syntax code, the design=WR option and entire "nest VESRT VEREP;" statement in PROC DESCRIPT have no impact on mean/proportion estimates. However, the variance estimation variables must be used (e.g., design=option and nest statement as above) to produce the SE estimates of descriptive (e.g., SE of mean/proportion) and inferential statistics (e.g., confidence intervals of mean/proportion and p-value of testing hypothesis).
1Prior to 2002, data were collected under the old title - National Household Survey on Drug Abuse (NHSDA)
2For further details on the sampling design and weighting adjustment method, please see the 2011 NSDUH Methodological Resource Book