Tuesday, March 25, 2014

How do I combine NSDUH public-use file (PUF) data for analysis?

Because of the 2002 National Survey of Drug Use and Health (NSDUH) methodology changes, the 2002 data constitute a new baseline for tracking trends in substance use and other measures. As noted in the 2002 to 2013 codebooks, it is not considered appropriate to make comparisons of the 2002 to 2013 estimates with 2001 NSDUH and earlier NHSDA (National Household Survey and Drug Use) estimates to assess trends in substance use. Though 1999 through 2004 data are part of the same sample design, beginning with the 2002 survey, respondents were given a $30 inceptive payment for participation, which increased response rates for several consecutive surveys.

Statistical disclosure limitation methods were implemented on the original data file in such a way that the NSDUH PUF continues to be representative of civilian members of the noninstitutionalized population in the United States. Disclosure limitation methods include micro agglomeration, optimal probabilistic substitution, optimal probabilistic subsampling, and optimal sampling weight calibration. Further variance estimation variables (VESTR and VEREP) were treated by coarsening, substitution, and scrambling. For the purpose of variance calculation, the sample design for NSDUH PUFs is a stratified single-stage cluster sample design with replacement sampling.

The 2002 through 2004 NSDUH PUFs are part of one sample design while the 2005 through 2013 PUFs are part of another sample design. There were 50% overlapping samples for adjacent survey years for the 2005 through 2013 surveys. VESTR (variance estimation stratum) is coded from 20001 to 20060 for years 2002 through 2004 in the NSDUH PUF datasets, and from 30001 to 30060 for years 2005 through 2013. VEREP (variance estimation cluster replicates) is coded as 1 and 2. The degrees of freedom (df) are 60 for national estimates of each individual survey1. When combining any years of data from 2005 through 2013, the df remains the same as it were for a single year (e.g., 60 for national estimates) since sampling of these years are part of the same sample design. This combined data can be used to obtain the standard error (SE) of estimates for individual years and/or SE of difference estimates (e.g., contrast of means) for the purposes of comparison between adjacent years. The df of 60 also remains the same when combining any years of data from 2002 through 2004, but when combining years of data from two different sample designs from 2002 through 2013 (or, at least one year data from 2002 through 2004 and at least one from 2005 through 2013), the df will be 120 (e.g., sum of the df for two different sample designs). For individual year [inferential] estimates using such a combined file containing data from multiple years with different sample designs, users must specify the customizable option for the degrees of freedom to override the default. Alternatively, users can subset data for a year within a procedure/method run using an appropriate statement so that complex design is retained for the desired analysis. When comparing estimates in two domains with different df (e.g., equality of the proportions of past month alcohol use for two individual survey years having different sample designs) in combined data, err on the conservative side and use the smaller degrees of freedom (see page A-2 in 2012 NSDUH Statistical Inference Report). Note that the covariance estimate between the estimates (e.g., proportions) in such comparison is zero because of two distinct designs.

Analysts can receive all of the ratio type estimates (including their standard errors, confidence intervals, and p-values etc.) from an analysis run of combine data. Note that sums/totals in cells and/or margins of an output from such a run should not always be the intended estimates. If the analyst is interested in an annual estimate of a population total in addition to ratio type estimates, the weight should be divided by the number of years that were pooled. Users should be careful in reporting and interpreting the results while using survey year variable in an analysis for pooled data with adjusted weight.



1 See Appendix A in 2012 NSDUH Statistical Inference Report.

How are complex sampling variances computed by Taylor linearization method?

The NSDUH public-use file consists of single-stage with replacement (WR) stratified cluster design sample data. The variance estimation stratum and cluster replicate variables are VESTR and VEREP, respectively. The cluster replicates are widely known as the primary sampling units (PSU). The calculation of variance by Taylor linearization method and the calculation of the degrees of freedom1 for Student's t test statistic will be discussed briefly below.

A closed-form/analytic variance formula cannot be derived for nonlinear statistics such as mean and proportion. The Taylor method derives a linearized variate for the nonlinear statistic of interest. It can be shown that the variance of linearized variate of a statistic is theoretically equal to the variance of that statistic. The variance estimation of a statistic by the Taylor method is nothing but the variance of linearized variate of that statistic by the closed-form variance method. We know that total is a linear statistic and mean is nonlinear, as it is the ratio between two linear statistics. The linearized variable for the total statistic of an analysis variable of interest is the same of that analysis variable, and therefore the variance formula of total by the Taylor series linearization method and the variance formula of total by the closed-form method are exactly the same. All SAS, SPSS, SUDAAN, Stata, and R software packages calculate the variance of a statistic from the deviations of the PSU-totals of linearized variate about the mean of all PSU-totals of linearized variate.

There are no single PSU (i.e., singleton) strata in the NSDUH data, but certain analyses may encounter singleton-stratum while calculating the variance for a domain or subclass/subgroup or subpopulation. A stratum is so called singleton-stratum when only one PSU has at least one valid observation and the other PSU has no observation in that stratum. PSUs with no observation are handled in different ways by the different software packages when calculating the variance and the degrees of freedom. The MISSUNIT option in the SUDAAN package, SINGLEUNIT(centered) option in the Stata package, and options(survey.lonely.psu=”adjust”) in the R package handle such cases by calculating the variance contribution for those singleton-strata using the deviations of PSU-total value of linearized variate about the grand mean of the sample of a particular analysis2. By default, SPSS handles this situation based on the assumption that there was at least one other PSU (if not then that stratum contribute null variance) in the sample and thus PSU with no observation (termed as sampling zeroes) would have PSU-totals as zero and definitely would contribute to the stratum-variance. Moreover, an analysis can also encounter some strata with no observations (empty strata). Users may experience such a situation in domain/subgroup analysis.

The question is how the variance estimation and the degrees of freedom computation handle this situation in different software packages. SUDAAN assumes there were actually no strata with non-missing cases in the population, but strata with missing cases as part of sample of selection. SUDAAN considers those missing units are sampling zeros; thus each of the empty strata contributes zero variance into the overall variance and in sequel, contributes to the degrees of freedom. The Stata software package does the same by default method; but the Stata procedure with certain statement/options, for instance the singleunit(centered) option, digress from this assumption of sampling zeroes and consider that such empty strata are structural zeros. The logic is that when a stratum has no cases at all then this stratum is assumed to not be part of the sampling for domains and therefore contributes null to the overall variance. In Stata, the degrees of freedom determined with the singleunit(centered) option is smaller than that obtained by the default method for domains not in common. In effect, this default method makes the increasing degrees of freedom for the variance of an estimate for subgroups with empty strata, although a significant increase in degrees of freedom is due to sampling zeros with empty strata that contribute zero into the variance. This accordingly estimates a decreased p-value of reference distribution for a hypothesis test (but the observed value of test statistic is unaltered) and also estimates a narrow confidence interval for the parameter.

The calculation of degrees of freedom (df) is crucial for all these software packages and influences the calculation of inferential statistics such as confidence intervals and p-values of test statistics. Conventionally, the df is calculated by the fixed-PSU method and the 'fixed' df is defined by the number of PSUs minus number of strata for the first stage in the sample design with any number of stages of sampling (i.e., from full data file). The SPSS and R software packages always use this fixed-PSU method for calculating the df in all aspects of analysis. This is the default setting in SUDAAN, but users can provide a predetermined number as the df with the user interactive DDF= option.  This fixed-PSU method is also the default in Stata, but this package has options that invoke Stata procedures to calculate an alternate df by the method known as the variable-PSU method. For example, Stata with the single unit (center) option uses the variable-PSU method for calculating the df and the variable df is calculated as the number of non-empty PSUs minus the number of non-empty strata. The number of non-empty PSUs is the number of PSUs in the sample MINUS the number of PSUs with no observation in all singleton strata. A user can manually calculate the 'variable' df for a domain analysis and specify it in SUDAAN with DDF=df parameter option in the PROC statement or specify 'design' df in Stata with svy, dof(df): in order to compare the estimates of inferential statistics across software packages. The DFADJ option with DOMAIN statement in SAS code computes the degrees of freedom for non-empty strata for an analysis variable in a domain.

SAMHDA's online data analysis system (SDA) calculates slightly different but appropriate df as the number of PSUs in non-empty strata minus the number of non-empty strata. SAS, SPSS and SDA handle singleton-strata almost equally.



1 Confidence interval of an estimator (e.g., mean or proportion) of a parameter is an inferential statistic that is being calculated using the critical value of a test statistic (e.g., for mean or proportion, in practice, Student's t statistic is used); obtained based on two factors: the confidence level and the degrees of freedom.

2 Grand mean is calculated from all valid PSU-totals of linearized variate for a particular domain or subclass /subgroup /subpopulation.

How do I account for complex sampling design when analyzing NSDUH data?

National Survey on Drug Use and Health (NSDUH)1 employs a multistage (stratified cluster) sample design for the selection of a representative sample from non-institutional members of United States households aged twelve and older. The NSDUH public-use file (PUF) includes the variance estimation variables (which were derived from the complex sample designs2): variance estimation stratum (VESTR), variance estimation cluster replicates (VEREP) and final analysis weight (ANALWT_C). VEREP is nested within the VESTR. It is therefore considered that the complex survey method for NSDUH PUF is a single-stage stratified clustering design, where the clusters are sampled with replacement (WR). There are no missing values in the variance estimation variables and final analysis weight, VESTR, VEREP and ANALWT_C. However, analysis variables can have missing values.

SUDAAN, all survey procedures in SAS, Stata, R and the survey add-on module in SPSS can handle data from complex sampling designs. The WR design is the default design, except in SPSS and Taylor series linearization is also the default method for variance estimation of them. Note that users should read the help document (of her/his respective statistical package) regarding how missing values are being handled if any exist in the analysis variables.

SAMPLE SYNTAX

Using analysis weights is important to get the point estimates right. Users must consider the weighting, clustering, and stratification of the survey design to produce correct standard errors (and degrees of freedom). The example code provided below shows how to specify these variables correctly, using an individual year of the NSDUH PUF, and also indicates how to calculate the proportions, standard errors (SE), and confidence intervals of the risk of smoking one or more packs of cigarettes per day by gender. This statistical analysis plan (SAP), in turn, results in two subpopulation analyses of proportions for each level of gender. The dependent or outcome variable is the risk of smoking one or more packs of cigarettes per day and is determined using the categorical variable, RSKPKCIG. Gender is determined using the categorical variable, IRSEX. Both of the variables in the NSDUH PUF file are numeric in downloadable SAS, Stata, SPSS, and R specific datasets. RSKPKCIG is coded numeric as 1 to 4 for no risk, slight risk, moderate risk, and great risk for valid values and as system missing for invalid values. IRSEX is imputation revised gender for missing values and is coded numeric as 1 for male and 2 for female.

For analysis of the NSDUH PUF file, one should consider three important things before preparing program code in a statistical software package:

  1. How to correctly specify the variance estimation variables including analysis weights;
  2. The statistical procedure along with requested statistics; and
  3. The domains of analysis, if any.

Each of these three considerations is discussed below. [Note the following conventions for wording in program syntax: upper case codes are statements/procedures, upper case italics are option keywords in software packages, and upper case bolded codes are variables from the input dataset.]

  1. Specify variance estimation variables. For variance estimation, each sample program code uses the Taylor linearization method for this example SAP. The WR design method is the default with the Taylor method for all but SPSS software packages. The stratification and clustering of the complex sample design in the NSDUH PUF are described by specifying the variance estimation variables (and also the analysis weight) via the statements in the analysis procedure program code for SUDAAN and SAS software packages. For example, "NEST VESTR VEREP /MISSUNIT; WEIGHT ANALWT_C;" in SUDAAN and "STRATA VESTR; CLUSTER VEREP; WEIGHT ANALWT_C;"in SAS. Note that the order of the variables in the NEST statement is important. One should use the above block of statements (specific to NSDUH complex design) in any survey procedure program code for any SAP. For example, PROC SURVEYLOGISTIC in SAS and PROC LOGISTIC in standalone SUDAAN.

    The above can be implemented by the SVYSET command in Stata as "SVYSET VEREP[pweight =ANALWT_C], STRATA(VESTR) SINGLEUNIT(centered)". SPSS requires an analysis plan design file for the data file to perform a complex survey analysis. The first block of code in the SPSS program syntax, below, for the CSPLAN ANALYSIS procedure will create such an analysis plan file.

    See the "How are complex sampling variances computed by Taylor linearization method?" FAQ on the use of MISSUNIT option in SUDAAN, SINGLEUNIT(centered) option in Stata, and the NOMCAR procedure option in SAS.

  2. Statistical procedures and requested statistics. In the SAP, RSKPKCIG is the analysis variable with four valid and missing values and IRSEX will be used to define the domains or subpopulations. [In general, the goal is to get results in proportions (not in percentages) of a (outcome or dependent) variable for multiple subpopulations.] Stata's SVY: PROPORTION produces estimates of proportions, along with SEs, for character or numeric variables. SAS's SURVEYMEANS always analyzes character variables as categorical. So, RSKPKCIG can be declared as a character variable by specifying it in both the CLASS and VAR statements to obtain the estimates of the proportions, along with SEs. R’s svyby procedure with the factor (variable) function and FUN=svymean argument produces the mean, SEs and confidence intervals, in which the factor function converts a variable to a set of dummy variables, while SUDAAN's DESCRIPT procedure and SPSS's CSDESCRIPTIVES method compute means, along with SEs, for only continuous variables. The mean estimate of a 0/1 coding dummy variable is essentially a proportion estimate. Therefore, the four dummy variables (RSK1, RSK2, RSK3, and RSK4) that were created for the RSKPKCIG variable also contain missing values. SUDAAN’s DESCRIP and SPSS’s CSDESCRIPTIVES procedures can be used for our desired analysis by obtaining means (i.e., proportions), along with SEs, for the four dummy variables of RSKPKCIG. [Note that using SVY: MEAN in Stata and SURVEYMEANS (no CLASS statement) in SAS, the same analysis of means can be obtained for a list of indicator variables of a categorical variable.]

  3. Domains of Analysis. The CLASS and TABLES statements in SUDAAN, OVER command in Stata, DOMAIN statement in SAS and SUBPOP TABLE statement in SPSS specify the multiple subpopulations/domains for a variable (e.g., two for IRSEX) to which analyses are to be performed. For example, the following SAS program code with "DOMAIN IRSEX;" statement will output the analysis results for the variables in VAR statement for two domains; one for the male (IRSEX=1) population and the other for the female (IRSEX=2) population.

SUDAAN standalone syntax:

[The input data file that is used for this example is in the SAS Export format. SUDAAN recommends that an input dataset is sorted by the variables in the NEST statement; otherwise, NOTSORTED option must be specified in the PROC statement.]

PROC DESCRIPT filetype=SASXPORT data="path\nsduh2011.xpt" NOTSORTED;
NEST  VESTR  VEREP  / MISSUNIT;
WEIGHT  ANALWT_C;
CLASS IRSEX;
TABLES IRSEX;
VAR  RSK1  RSK2  RSK3  RSK4;
PRINT mean semean lowmean upmean
/style=nchs meanfmt=f6.3 semeanfmt=f6.3 lowmeanfmt=f7.3 upmeanfmt=f7.3;

Stata specific code for the same analysis:

use drivename:\path\nsduh2011.dta
svyset VEREP[pweight=ANALWT_C], strata(VESTRsingleunit(centered)
save drivename:\path\nsduh2011svy.dta, replace

[It is a good practice to save the survey setting permanently in the data file. This allows for this saved data to be used for any subsequent survey analysis.]

use drivename:\path\nsduh2011svy.dta
SVY: PROP RSKPKCIG, OVER(IRSEX)

SAS code for this analysis:

PROC SURVEYMEANS data=sasdata NOMCAR mean clm;
STRATA VESRT;
CLUSTER VEREP;
WEIGHT ANALWT_C;
DOMAIN IRSEX / DFADJ;
CLASS RSKPKCIG;
VAR RSKPKCIG;
run;

The SPSS specific code for the same analysis:

[The example code assumes that the user's SPSS software package has the complex survey module installed. Select the first block of code below, copy, and then paste it into the SPSS Syntax Editor. Replace the ‘folder-path' by providing the path location where you would like to save the nsduh11.csplan xml file. Select and run this modified syntax code in the Syntax Editor. CSPLAN will write this Analysis Design into the nsduh11.csplan xml file. The second block of code shows how to reference the nsduh11.csplan xml file in the CSDESCRIPTIVES and other Complex Survey procedures in the current and future SPSS sessions.]

CSPLAN ANALYSIS
/PLAN FILE='folder-path\nsduh11.csplan'
/PLANVARS ANALYSISWEIGHT= ANALWT_C
/DESIGN STRATA= VESTR CLUSTER= VEREP
 /ESTIMATOR TYPE = WR.

get file='path\nsduh2011.sav'.

CSDESCRIPTIVES
/PLAN FILE= 'folder-path\nsduh11.csplan'
/SUMMARY VARIABLES= RSK1  RSK2  RSK3  RSK4
/SUBPOP TABLE= IRSEX DISPLAY=LAYERED
/MEAN
/STATISTICS se cin(95)
/MISSING SCOPE=ANALYSIS CLASSMISSING=EXCLUDE.

R specific sample code for the same analysis:

load("folder-path/nsduh2011.rda")
keepvars = c("VESTR", "VEREP",  "ANALWT_C", "IRSEX", "RSKPKCIG" )
nsduh11 = nsduh2011[, keepvars]             #make a data file with fewer variables

library(prettyR)   # requires to install the prettyR package
nsduh11$RSKPKCIG<- as.numeric(sub("^\\(0*([0-9]+)\\).+$", "\\1", nsduh11$RSKPKCIG))
nsduh11$IRSEX <- as.numeric(sub("^\\(0*([0-9]+)\\).+$", "\\1", nsduh11$IRSEX))

library(survey)  #needs to install the survey package
options( survey.lonely.psu = "adjust" )
desg <- svydesign(id = ~VEREP , strata = ~VESTR , weights = ~ANALWT_C , data = nsduh11 , nest = TRUE )

# calculate the means or proportions of RSKPKCIG by the levels of IRSEX and their SEs
out = svyby(~factor(RSKPKCIG), ~IRSEX, design = desg , FUN=svymean, vartype=c("se","ci"), na.rm = TRUE)
coef(out)      #extracting the means
SE(out)         #extracting the SEs of means
confint(out)  # 95% confidence intervals of mean
print(out)     #all results

Note that the variance estimation variables (except the sampling weight variable) do not affect the mean, proportion, percent, and other first-order moment statistics. For example, in SUDAAN syntax code, the design=WR option and entire "nest VESRT VEREP;" statement in PROC DESCRIPT have no impact on mean/proportion estimates. However, the variance estimation variables must be used (e.g., design=option and nest statement as above) to produce the SE estimates of descriptive (e.g., SE of mean/proportion) and inferential statistics (e.g., confidence intervals of mean/proportion and p-value of testing hypothesis).



1Prior to 2002, data were collected under the old title - National Household Survey on Drug Abuse (NHSDA)

2For further details on the sampling design and weighting adjustment method, please see the 2011 NSDUH Methodological Resource Book

Friday, March 7, 2014

What are the differences between NSDUH public-use and restricted-use data?

NSDUH public-use and restricted-use data differ in terms of access, availability, and variable groups. Users can reference the Analysis Options for NSDUH Public-use and Restricted-use Data page for help with determining which available option best meets their research needs: public-use (downloadable data), SDA (?) (online analysis of public-use data), R-DAS (?) (online analysis with disclosure restrictions), or the Data Portal (?) (virtual desktop access to restricted-use microdata).

Wednesday, February 26, 2014

How can I access the NCS-1, 1990-1992 study? I can no longer find it on the SAMHDA site.

The NCS-1, 1990-1992 study has been transferred from SAMHDA to the National Addiction & HIV Data Archive Program (NAHDAP) Archive. NCS-1 data and documentation files can now be accessed through the NAHDAP and ICPSR General Archive websites. We apologize for any inconvenience this transition may have caused.

Tuesday, February 11, 2014

Which variables are available in the restricted-use NSDUH data files?

The following variable crosswalk displays all variables in the restricted-use NSDUH data files and their availability for specific groups of years.

Monday, October 21, 2013

How can I access the HBSC series? I can no longer find HBSC on the SAMHDA site.

The HBSC series has been transferred from SAMHDA to the National Addiction & HIV Data Archive Program (NAHDAP) Archive. All HBSC data and documentation files can be accessed through the NAHDAP and ICPSR General Archive websites. We apologize for any inconvenience this transition may have caused.