SAMHDA FAQs: 2014

Monday, November 17, 2014

[DUPLICATE FAQ] What is the Restricted-use Data Analysis System (R-DAS)?

The R-DAS is an online analysis system that allows researchers to produce frequencies and cross-tabulations using restricted-use data files. The R-DAS provides output that is available for viewing and export. The R-DAS provides tables and frequencies. Advanced statistical methods are not available at this time.

The R-DAS does not permit listing of individual cases and does not provide unweighted frequencies in the R-DAS codebook, nor are users able to generate unweighted frequencies (no unweighted sample sizes are provided). These limitations have been put in place to reduce the potential for disclosing confidential information.

The R-DAS provides standard errors that take into account the complex survey design. All weighted totals and point estimates are rounded to the nearest thousand, and all percents and associated statistics are rounded to one decimal point. If any cell in a table contains too few unweighted cases, then the entire table is suppressed.

The R-DAS does not currently allow for the creation of composite variables (i.e., the creation of new variables using other variables), but that capability is under development. The R-DAS does allow for recoding of existing continuous and categorical variables. See the SDA 3.5 help documentation for assistance with how to Temporarily Recode a Variable.

Watch the "Broadening Access to Substance Abuse and Mental Health Data with the R-DAS" webinar to learn about the National Survey on Drug Use and Health (NSDUH) data available through the R-DAS.

For more information on analyzing data with the R-DAS, consult the FAQ section on Help with the Restricted-use Data Analysis System (R-DAS).

[WEBINAR] Broadening Access to Substance Abuse and Mental Health Data with the R-DAS

Learn about the National Survey on Drug Use and Health (NSDUH) data available through the R-DAS. The Broadening Access to Substance Abuse and Mental Health Data with the R-DAS webinar provides:

A general understanding of the data and resources available through SAMHDA;
Differences between the public-use and restricted-use NSDUH data files;
Instructions on how to locate and access restricted-use NSDUH data in the R-DAS; and
A brief demonstration on how to create a cross-tabulation in the R-DAS.

Thursday, November 6, 2014

What are the differences between DAWN public-use and restricted-use data?

DAWN public-use and restricted-use data differ in terms of access, availability, and variable groups. Users can reference the Analysis Options for DAWN Public-use and Restricted-use Data page for help with determining which available option best meets their research needs: public-use (downloadable data), SDA (?) (online analysis of public-use data), or the Data Portal (?) (virtual desktop access to restricted-use microdata).

Tuesday, April 1, 2014

When is the next application period for access to the Data Portal?

The Data Portal call for applications ended on December 15, 2014.

For information about the application process and data available through the Data Portal, please visit the Data Portal page. For further assistance with the Data Portal, please email dataportal@icpsr.umich.edu.

Tuesday, March 25, 2014

How do I calculate variance of totals for NSDUH data?

The final analysis (poststratification adjustment) weight variable, ANALWT_C, in the NSDUH public-use file (PUF) file was adjusted for unequal probability of selection, nonresponse of respondents, and coverage bias of respondents to the poststratification population totals from United States Census 2000. All NSDUH surveys used U.S. Census 2000 except the2002 and 2003 surveys, which used the 1990 Census, and the 2004 survey, which used 50% from each of 1990 and 2000 Censuses. By using ANALWT_C, the weighted estimates of totals obtained for any of the survey variables are the (target population) estimates for the entire universe of civilian members of the noninstitutionalized population in the United States.

For the NSDUH PUF file, mixed method approaches are recommended for variance estimation of totals. Why mixed approaches? Because in some domain analyses, the estimated domain sizes are subject to sampling variability, and in other special domain analyses, the estimated domain sizes are not subject to sampling variability. We can obtain the variance estimates of totals for the former case directly from the procedures of software packages using the Taylor series linearization method. For the latter situation (i.e., fixed/controlled domain sizes), this variance can indirectly/manually be computed with the method discussed below (see page i-20 in the 2011 NSDUH Statistical Inference Report.) Note that variance estimates of the estimated total of a variable will not be reliable if it is obtained by a software package procedure for fixed-domains.

The analysis weight variable, ANALWT_C, was adjusted by a poststratification weight calibration method. With this method, a set of post-strata classes (i.e., demographic domains) were constructed using some demographic variables of respondents (such as age, gender, and/or race), and these demographic domains were forced to match their respective U.S. Census Bureau population estimates through the weight calibration process. Each estimated domain size of a domain in the NSDUH PUF data is equal to the corresponding population domain size estimate of U.S. Census data. This set of domains, considered controlled (i.e., those with fixed domain sizes), was restricted to main effects and two-way interactions in order to maintain continuity between years (age, gender, and race were used in the 2011 NSDUH for constructing the domains; see page i-22 in the 2011 NSDUH codebook, and age and gender were used in the 2002 to 2010 NSDUH; see page i-21 in the 2010 NSDUH codebook). This is why the variance estimates for the estimated total of an interested variable for a domain, which is specific to post-strata class variable, or a combination of variables (such as age, gender, and/or race), or a combination of post-strata class and non-post-strata class variable(s), requires special attention to manually compute the variance estimate outside the procedure (for details about variance of totals and differences see pages i-20 in the 2011 NSDUH codebook). Examples of controlled domains are (i) the domains are defined by each of the two levels of Gender and (ii) the domains are constituted by the combination of the levels of two categories of Gender and three categories of Age. An example of non-controlled (i.e., random) domains may be defined by each of the four levels of EMPSTATY (imputation revised employment status) because EMPSTATY is not a post-stratification weight class variable. Whereas (iii) the domains defined by the combination of the levels between gender and EMPSTATY cannot be regarded as random or controlled domains.

Suppose a user desires to obtain the total of a continuous or an indicator (e.g., DEPNDALC, the alcohol dependence) variable, along with the standard error (SE), of the respondents by a domain variable. The output in each of the cells of the Table is for a domain analysis. The weighted total number of the respondents (i.e., the estimated sub/population size) in each cell is the sum of the analysis weight variable, denoted by DomainSize, and the weighted total estimate of DEPNDALC is the sum of the product of the analysis weight and DEPNDALC in observations, denoted by TotalDALC. The weighted mean alcohol dependence estimate, denoted by MeanDALC, is the ratio between TotalDALC and DomainSize. All software packages calculate the SE of TotalDALC, denoted by SE(TotalDALC), using a closed-form/analytic variance formula since TotalDALC is a linear statistic, and the SE of MeanDALC, denoted by SE(MeanDALC), using the Taylor linearization method (as default) since MeanDALC is a nonlinear statistic. For a fixed-domain, this SE estimate of SE(TotalDALC) of the estimated total of alcohol dependence will not be reliable. The TotalDALC and the MeanDALC estimates are always considered as random variables but the DomainSize estimate is not always treated as a random variable. How the DomainSize estimate is handled depends on the way the domains are formed in an analysis using NSDUH data. Situations for which a domain size will be treated as fixed/controlled for NSDUH PUF data were discussed earlier in this FAQ with examples.

The weighted total estimate of alcohol dependence in a cell can also be computed from the weighted total respondents multiplied by the weighted mean alcohol dependence of respondents, i.e., (by mean method) TotalDALC _M = DomainSize x MeanDALC. For controlled/fixed domains, an appropriate estimate of the SE for the total number of persons with a characteristic (e.g., alcohol dependence in this illustration) of interest is SE(TotalDALC_M)¹ = (DomainSize) * SE(MeanDALC), where SE(MeanDALC) is computed correctly by the Taylor linearization method using the software package procedures. The point estimates of TotalDALC and TotalDALC_M are the same, but their SE estimates are substantially different. None of the software packages directly produce this SE estimate for totals using the above formula. Note again that for non-controlled domains, the users can directly obtain the variance of the estimated totals using the software packages.

Side note: The analysis weight variable in the DAWN data is also a poststratification adjusted weight but poststratification adjustments were implemented within the design strata to offset the (sampling frame) coverage bias on the estimates. No post-strata classes were constructed using any of the Emergency Department (ED) visit characteristic variables to force the class sizes to match the respective benchmark population totals of ED visits from the American Hospital Association (AHA) database by the weight calibration method; thus, unlike NSDUH, SE estimate of totals can be obtained directly from software package procedures.

Sample syntax to calculate SE of total for controlled Domains

All software packages produce the SE of a total statistic but this SE is not appropriate if this total estimate is obtained for a controlled domain. Using an individual year data, the following program code shows how to compute the SEs of the totals from controlled domains using the SUDAAN procedure in conjunction with the SAS data step and Stata's svy: mean procedure. The calculation of SE for totals also requires the SE of means. Users can find help with the statistical analysis plan for the later topic in the "How do I account the complex sampling design when analyzing NSDUH data?" FAQ.

SUDAAN in conjunction with SAS sample code

proc descript design=WR data=work.nsduh11_analysis notsorted nomarg;
nest vestr verep; weight analwt_c;
class irsex/ nofreqs;
tables irsex;
var rsk1 rsk2 rsk3 rsk4;
output wsum semean total /replace filetype=sas filename=work.sud_result1
wsumfmt=f15.3 semeanfmt=f12.10 totalfmt=f15.4; #SE of mean with 10 decimal points to avoid rounding error
run;

The wsum and total statistics requested in the output statement are the population size estimate (which is considered not subject to sampling variability) and the weighted total estimate in the Table cell. Note that the setotal statistic was not requested, as this SE is not appropriate.

#Using SAS, compute: SE(total) = Domain_size*SE(mean);
data work.sud_result1;
set work.sud_result1;
if semean gt 0.0 then setotal = wsum * semean;
run;
proc print data= work.sud_result1;
run;

Stata sample code:

A user requires some knowledge of vector-matrix (element-wise) computation for this program.

use drivename:\path\statadatafilename.dta
svyset VEREP[pweight=ANALWT_C], strata(VESTR) singleunit(centered)
save drivename:\path\mystatadata.dta, replace

svy: mean rsk1 rsk2 rsk3 rsk4, over (IRSEX)
#extracts variance-covariance matrix for means
matrix vcov = e(V)
matrix var = vecdiag(vcov)
#extracts weighted size (denominator totals of mean), i.e., domain sizes
matrix wsum = e(_N_subp)
matrix setotal = J(1,8,0)
#compute SE: setotal = wsum * se(mean) for each total estimates
local j=1
while `j'<=8{
matrix setotal[1,`j'] = wsum[1,`j']*sqrt(var[1,`j'])
local j = `j' + 1
}
svy: total rsk1 rsk2 rsk3 rsk4, over (IRSEX)
matrix totals = e(b)
matrix result = (wsum', totals', setotal')
matrix colnames result = Domain_size Total SEtotal
matrix list result

SAS sample code:

proc surveymeans data= work.nsduh11 nobs nmiss mean sum sumwgt NOMCAR;

strata vestr; cluster verep; weight analwt_c;
domain irsex;
class rskpkcig;
var rskpkcig;
ods output domain = mydomain;

run;

data work.myresult(keep=IRSEX RSKPKCIG N Nmiss Domain_wgt_size Totals SE_totals);

set work.mydomain;
SE_totals = sumwgt * stdErr;
rename VarLevel=RSKPKCIG sum = Totals sumwgt = Domain_wgt_size;

run;

R sample code:

load("folder-path/mydat.rda")     #2011 NSDUH PUF
# save the loaded data with a new “mydat“ name # RSKPKCIG and IRSEX are factor variables; we need to make them numeric.
library(prettyR)   # requires to install the prettyR package
mydat$RSKPKCIG<- as.numeric(sub("^\$0*([0-9]+)\$.+$", "\\1", mydat$RSKPKCIG))
mydat$IRSEX <- as.numeric(sub("^\$0*([0-9]+)\$.+$", "\\1", mydat$IRSEX))
# compute the totals and SEs of totals [for controlled domains]
# formula : total_hat = N * p_hat, so its SE is SE(total_hat) = N * SE(p_hat)
totals1=aggregate(mydat$ANALWT_C,by=list(mydat$RSKPKCIG, mydat$IRSEX), FUN=sum, na.rm=TRUE)

library(reshape)   #   reshape the package that has the RENAME statement command
# c(oldVARname="newVARname")
totals2 <- rename(totals1, c(Group.1="RSKPKCIG", Group.2="IRSEX",x="totals"))
# totals2 has no missing values
domsize1=aggregate(totals2$totals,by=list(totals2$IRSEX), FUN=sum)
domsize2 <- rename(domsize1, c(Group.1="IRSEX",x="domain_size"))
# merging data frames
domainSize = merge(domsize2, totals2, by=c("IRSEX"))

require(survey)   # requires to install the survey package
options( survey.lonely.psu = "adjust" )
desg <- svydesign(id = ~VEREP , strata = ~VESTR , weights = ~ANALWT_C , data = mydat, nest = TRUE )

# obtaining SEs of the proportions of the categorical variable, RSKPKCIG by IRSEX
out = SE(svyby(~factor(RSKPKCIG), ~IRSEX, design = desg, FUN=svymean, vartype=c("se"), na.rm = TRUE) )

library(reshape2)       #first install reshape2 and then use the library() command
#reshape long to wide
long2wide = dcast(domainSize, IRSEX ~ RSKPKCIG, value.var="domain_size") long2wide$IRSEX = NULL
N = as.matrix(long2wide)        # N = fixed (domain) sizes
SE_p_bar = as.matrix(out)
se_totals = N * SE_p_bar                     # elementwise matrix multiplication
SE_totals = as.vector(t(se_totals))     # Row-wise vectorize   [row vector is a vector, column vector is a matrix]
myresult=cbind(domainSize, as.data.frame (SE_totals))
# re-arranging the variables
myresult = myresult[c("IRSEX", "RSKPKCIG", "domain_size", "totals", "SE_totals")]

print(myresult)

A related FAQ, How do I calculate variance of difference between totals for NSDUH data?, may provide additional useful information.

¹ See Section 5 in 2012 NSDUH Statistical Inference Report

How do I combine NSDUH public-use file (PUF) data for analysis?

Because of the 2002 National Survey of Drug Use and Health (NSDUH) methodology changes, the 2002 data constitute a new baseline for tracking trends in substance use and other measures. As noted in the 2002 to 2013 codebooks, it is not considered appropriate to make comparisons of the 2002 to 2013 estimates with 2001 NSDUH and earlier NHSDA (National Household Survey and Drug Use) estimates to assess trends in substance use. Though 1999 through 2004 data are part of the same sample design, beginning with the 2002 survey, respondents were given a $30 inceptive payment for participation, which increased response rates for several consecutive surveys.

Statistical disclosure limitation methods were implemented on the original data file in such a way that the NSDUH PUF continues to be representative of civilian members of the noninstitutionalized population in the United States. Disclosure limitation methods include micro agglomeration, optimal probabilistic substitution, optimal probabilistic subsampling, and optimal sampling weight calibration. Further variance estimation variables (VESTR and VEREP) were treated by coarsening, substitution, and scrambling. For the purpose of variance calculation, the sample design for NSDUH PUFs is a stratified single-stage cluster sample design with replacement sampling.

The 2002 through 2004 NSDUH PUFs are part of one sample design while the 2005 through 2013 PUFs are part of another sample design. There were 50% overlapping samples for adjacent survey years for the 2005 through 2013 surveys. VESTR (variance estimation stratum) is coded from 20001 to 20060 for years 2002 through 2004 in the NSDUH PUF datasets, and from 30001 to 30060 for years 2005 through 2013. VEREP (variance estimation cluster replicates) is coded as 1 and 2. The degrees of freedom (df) are 60 for national estimates of each individual survey1. When combining any years of data from 2005 through 2013, the df remains the same as it were for a single year (e.g., 60 for national estimates) since sampling of these years are part of the same sample design. This combined data can be used to obtain the standard error (SE) of estimates for individual years and/or SE of difference estimates (e.g., contrast of means) for the purposes of comparison between adjacent years. The df of 60 also remains the same when combining any years of data from 2002 through 2004, but when combining years of data from two different sample designs from 2002 through 2013 (or, at least one year data from 2002 through 2004 and at least one from 2005 through 2013), the df will be 120 (e.g., sum of the df for two different sample designs). For individual year [inferential] estimates using such a combined file containing data from multiple years with different sample designs, users must specify the customizable option for the degrees of freedom to override the default. Alternatively, users can subset data for a year within a procedure/method run using an appropriate statement so that complex design is retained for the desired analysis. When comparing estimates in two domains with different df (e.g., equality of the proportions of past month alcohol use for two individual survey years having different sample designs) in combined data, err on the conservative side and use the smaller degrees of freedom (see page A-2 in 2012 NSDUH Statistical Inference Report). Note that the covariance estimate between the estimates (e.g., proportions) in such comparison is zero because of two distinct designs.

Analysts can receive all of the ratio type estimates (including their standard errors, confidence intervals, and p-values etc.) from an analysis run of combine data. Note that sums/totals in cells and/or margins of an output from such a run should not always be the intended estimates. If the analyst is interested in an annual estimate of a population total in addition to ratio type estimates, the weight should be divided by the number of years that were pooled. Users should be careful in reporting and interpreting the results while using survey year variable in an analysis for pooled data with adjusted weight.

¹ See Appendix A in 2012 NSDUH Statistical Inference Report.

How are complex sampling variances computed by Taylor linearization method?

The NSDUH public-use file consists of single-stage with replacement (WR) stratified cluster design sample data. The variance estimation stratum and cluster replicate variables are VESTR and VEREP, respectively. The cluster replicates are widely known as the primary sampling units (PSU). The calculation of variance by Taylor linearization method and the calculation of the degrees of freedom¹ for Student's t test statistic will be discussed briefly below.

A closed-form/analytic variance formula cannot be derived for nonlinear statistics such as mean and proportion. The Taylor method derives a linearized variate for the nonlinear statistic of interest. It can be shown that the variance of linearized variate of a statistic is theoretically equal to the variance of that statistic. The variance estimation of a statistic by the Taylor method is nothing but the variance of linearized variate of that statistic by the closed-form variance method. We know that total is a linear statistic and mean is nonlinear, as it is the ratio between two linear statistics. The linearized variable for the total statistic of an analysis variable of interest is the same of that analysis variable, and therefore the variance formula of total by the Taylor series linearization method and the variance formula of total by the closed-form method are exactly the same. All SAS, SPSS, SUDAAN, Stata, and R software packages calculate the variance of a statistic from the deviations of the PSU-totals of linearized variate about the mean of all PSU-totals of linearized variate.

There are no single PSU (i.e., singleton) strata in the NSDUH data, but certain analyses may encounter singleton-stratum while calculating the variance for a domain or subclass/subgroup or subpopulation. A stratum is so called singleton-stratum when only one PSU has at least one valid observation and the other PSU has no observation in that stratum. PSUs with no observation are handled in different ways by the different software packages when calculating the variance and the degrees of freedom. The MISSUNIT option in the SUDAAN package, SINGLEUNIT(centered) option in the Stata package, and options(survey.lonely.psu=”adjust”) in the R package handle such cases by calculating the variance contribution for those singleton-strata using the deviations of PSU-total value of linearized variate about the grand mean of the sample of a particular analysis². By default, SPSS handles this situation based on the assumption that there was at least one other PSU (if not then that stratum contribute null variance) in the sample and thus PSU with no observation (termed as sampling zeroes) would have PSU-totals as zero and definitely would contribute to the stratum-variance. Moreover, an analysis can also encounter some strata with no observations (empty strata). Users may experience such a situation in domain/subgroup analysis.

The question is how the variance estimation and the degrees of freedom computation handle this situation in different software packages. SUDAAN assumes there were actually no strata with non-missing cases in the population, but strata with missing cases as part of sample of selection. SUDAAN considers those missing units are sampling zeros; thus each of the empty strata contributes zero variance into the overall variance and in sequel, contributes to the degrees of freedom. The Stata software package does the same by default method; but the Stata procedure with certain statement/options, for instance the singleunit(centered) option, digress from this assumption of sampling zeroes and consider that such empty strata are structural zeros. The logic is that when a stratum has no cases at all then this stratum is assumed to not be part of the sampling for domains and therefore contributes null to the overall variance. In Stata, the degrees of freedom determined with the singleunit(centered) option is smaller than that obtained by the default method for domains not in common. In effect, this default method makes the increasing degrees of freedom for the variance of an estimate for subgroups with empty strata, although a significant increase in degrees of freedom is due to sampling zeros with empty strata that contribute zero into the variance. This accordingly estimates a decreased p-value of reference distribution for a hypothesis test (but the observed value of test statistic is unaltered) and also estimates a narrow confidence interval for the parameter.

The calculation of degrees of freedom (df) is crucial for all these software packages and influences the calculation of inferential statistics such as confidence intervals and p-values of test statistics. Conventionally, the df is calculated by the fixed-PSU method and the 'fixed' df is defined by the number of PSUs minus number of strata for the first stage in the sample design with any number of stages of sampling (i.e., from full data file). The SPSS and R software packages always use this fixed-PSU method for calculating the df in all aspects of analysis. This is the default setting in SUDAAN, but users can provide a predetermined number as the df with the user interactive DDF= option. This fixed-PSU method is also the default in Stata, but this package has options that invoke Stata procedures to calculate an alternate df by the method known as the variable-PSU method. For example, Stata with the single unit (center) option uses the variable-PSU method for calculating the df and the variable df is calculated as the number of non-empty PSUs minus the number of non-empty strata. The number of non-empty PSUs is the number of PSUs in the sample MINUS the number of PSUs with no observation in all singleton strata. A user can manually calculate the 'variable' df for a domain analysis and specify it in SUDAAN with DDF=df parameter option in the PROC statement or specify 'design' df in Stata with svy, dof(df): in order to compare the estimates of inferential statistics across software packages. The DFADJ option with DOMAIN statement in SAS code computes the degrees of freedom for non-empty strata for an analysis variable in a domain.

SAMHDA's online data analysis system (SDA) calculates slightly different but appropriate df as the number of PSUs in non-empty strata minus the number of non-empty strata. SAS, SPSS and SDA handle singleton-strata almost equally.

¹ Confidence interval of an estimator (e.g., mean or proportion) of a parameter is an inferential statistic that is being calculated using the critical value of a test statistic (e.g., for mean or proportion, in practice, Student's t statistic is used); obtained based on two factors: the confidence level and the degrees of freedom.

² Grand mean is calculated from all valid PSU-totals of linearized variate for a particular domain or subclass /subgroup /subpopulation.

How do I account for complex sampling design when analyzing NSDUH data?

National Survey on Drug Use and Health (NSDUH)¹ employs a multistage (stratified cluster) sample design for the selection of a representative sample from non-institutional members of United States households aged twelve and older. The NSDUH public-use file (PUF) includes the variance estimation variables (which were derived from the complex sample designs²): variance estimation stratum (VESTR), variance estimation cluster replicates (VEREP) and final analysis weight (ANALWT_C). VEREP is nested within the VESTR. It is therefore considered that the complex survey method for NSDUH PUF is a single-stage stratified clustering design, where the clusters are sampled with replacement (WR). There are no missing values in the variance estimation variables and final analysis weight, VESTR, VEREP and ANALWT_C. However, analysis variables can have missing values.

SUDAAN, all survey procedures in SAS, Stata, R and the survey add-on module in SPSS can handle data from complex sampling designs. The WR design is the default design, except in SPSS and Taylor series linearization is also the default method for variance estimation of them. Note that users should read the help document (of her/his respective statistical package) regarding how missing values are being handled if any exist in the analysis variables.

SAMPLE SYNTAX

Using analysis weights is important to get the point estimates right. Users must consider the weighting, clustering, and stratification of the survey design to produce correct standard errors (and degrees of freedom). The example code provided below shows how to specify these variables correctly, using an individual year of the NSDUH PUF, and also indicates how to calculate the proportions, standard errors (SE), and confidence intervals of the risk of smoking one or more packs of cigarettes per day by gender. This statistical analysis plan (SAP), in turn, results in two subpopulation analyses of proportions for each level of gender. The dependent or outcome variable is the risk of smoking one or more packs of cigarettes per day and is determined using the categorical variable, RSKPKCIG. Gender is determined using the categorical variable, IRSEX. Both of the variables in the NSDUH PUF file are numeric in downloadable SAS, Stata, SPSS, and R specific datasets. RSKPKCIG is coded numeric as 1 to 4 for no risk, slight risk, moderate risk, and great risk for valid values and as system missing for invalid values. IRSEX is imputation revised gender for missing values and is coded numeric as 1 for male and 2 for female.

For analysis of the NSDUH PUF file, one should consider three important things before preparing program code in a statistical software package:

How to correctly specify the variance estimation variables including analysis weights;
The statistical procedure along with requested statistics; and
The domains of analysis, if any.

Each of these three considerations is discussed below. [Note the following conventions for wording in program syntax: upper case codes are statements/procedures, upper case italics are option keywords in software packages, and upper case bolded codes are variables from the input dataset.]

Specify variance estimation variables. For variance estimation, each sample program code uses the Taylor linearization method for this example SAP. The WR design method is the default with the Taylor method for all but SPSS software packages. The stratification and clustering of the complex sample design in the NSDUH PUF are described by specifying the variance estimation variables (and also the analysis weight) via the statements in the analysis procedure program code for SUDAAN and SAS software packages. For example, "NEST VESTR VEREP /MISSUNIT; WEIGHT ANALWT_C;" in SUDAAN and "STRATA VESTR; CLUSTER VEREP; WEIGHT ANALWT_C;"in SAS. Note that the order of the variables in the NEST statement is important. One should use the above block of statements (specific to NSDUH complex design) in any survey procedure program code for any SAP. For example, PROC SURVEYLOGISTIC in SAS and PROC LOGISTIC in standalone SUDAAN.

The above can be implemented by the SVYSET command in Stata as "SVYSET VEREP[pweight =ANALWT_C], STRATA(VESTR) SINGLEUNIT(centered)". SPSS requires an analysis plan design file for the data file to perform a complex survey analysis. The first block of code in the SPSS program syntax, below, for the CSPLAN ANALYSIS procedure will create such an analysis plan file.

See the "How are complex sampling variances computed by Taylor linearization method?" FAQ on the use of MISSUNIT option in SUDAAN, SINGLEUNIT(centered) option in Stata, and the NOMCAR procedure option in SAS.
Statistical procedures and requested statistics. In the SAP, RSKPKCIG is the analysis variable with four valid and missing values and IRSEX will be used to define the domains or subpopulations. [In general, the goal is to get results in proportions (not in percentages) of a (outcome or dependent) variable for multiple subpopulations.] Stata's SVY: PROPORTION produces estimates of proportions, along with SEs, for character or numeric variables. SAS's SURVEYMEANS always analyzes character variables as categorical. So, RSKPKCIG can be declared as a character variable by specifying it in both the CLASS and VAR statements to obtain the estimates of the proportions, along with SEs. R’s svyby procedure with the factor (variable) function and FUN=svymean argument produces the mean, SEs and confidence intervals, in which the factor function converts a variable to a set of dummy variables, while SUDAAN's DESCRIPT procedure and SPSS's CSDESCRIPTIVES method compute means, along with SEs, for only continuous variables. The mean estimate of a 0/1 coding dummy variable is essentially a proportion estimate. Therefore, the four dummy variables (RSK1, RSK2, RSK3, and RSK4) that were created for the RSKPKCIG variable also contain missing values. SUDAAN’s DESCRIP and SPSS’s CSDESCRIPTIVES procedures can be used for our desired analysis by obtaining means (i.e., proportions), along with SEs, for the four dummy variables of RSKPKCIG. [Note that using SVY: MEAN in Stata and SURVEYMEANS (no CLASS statement) in SAS, the same analysis of means can be obtained for a list of indicator variables of a categorical variable.]
Domains of Analysis. The CLASS and TABLES statements in SUDAAN, OVER command in Stata, DOMAIN statement in SAS and SUBPOP TABLE statement in SPSS specify the multiple subpopulations/domains for a variable (e.g., two for IRSEX) to which analyses are to be performed. For example, the following SAS program code with "DOMAIN IRSEX;" statement will output the analysis results for the variables in VAR statement for two domains; one for the male (IRSEX=1) population and the other for the female (IRSEX=2) population.

SUDAAN standalone syntax:

[The input data file that is used for this example is in the SAS Export format. SUDAAN recommends that an input dataset is sorted by the variables in the NEST statement; otherwise, NOTSORTED option must be specified in the PROC statement.]

PROC DESCRIPT filetype=SASXPORT data="path\nsduh2011.xpt" NOTSORTED;
NEST VESTR VEREP / MISSUNIT;
WEIGHT ANALWT_C;
CLASS IRSEX;
TABLES IRSEX;
VAR RSK1 RSK2 RSK3 RSK4;
PRINT mean semean lowmean upmean
/style=nchs meanfmt=f6.3 semeanfmt=f6.3 lowmeanfmt=f7.3 upmeanfmt=f7.3;

Stata specific code for the same analysis:

use drivename:\path\nsduh2011.dta
svyset VEREP[pweight=ANALWT_C], strata(VESTR) singleunit(centered)
save drivename:\path\nsduh2011svy.dta, replace

[It is a good practice to save the survey setting permanently in the data file. This allows for this saved data to be used for any subsequent survey analysis.]

use drivename:\path\nsduh2011svy.dta
SVY: PROP RSKPKCIG, OVER(IRSEX)

SAS code for this analysis:

PROC SURVEYMEANS data=sasdata NOMCAR mean clm;
STRATA VESRT;
CLUSTER VEREP;
WEIGHT ANALWT_C;
DOMAIN IRSEX / DFADJ;
CLASS RSKPKCIG;
VAR RSKPKCIG;
run;

The SPSS specific code for the same analysis:

[The example code assumes that the user's SPSS software package has the complex survey module installed. Select the first block of code below, copy, and then paste it into the SPSS Syntax Editor. Replace the ‘folder-path' by providing the path location where you would like to save the nsduh11.csplan xml file. Select and run this modified syntax code in the Syntax Editor. CSPLAN will write this Analysis Design into the nsduh11.csplan xml file. The second block of code shows how to reference the nsduh11.csplan xml file in the CSDESCRIPTIVES and other Complex Survey procedures in the current and future SPSS sessions.]

CSPLAN ANALYSIS
/PLAN FILE='folder-path\nsduh11.csplan'
/PLANVARS ANALYSISWEIGHT= ANALWT_C
/DESIGN STRATA= VESTR CLUSTER= VEREP
/ESTIMATOR TYPE = WR.

get file='path\nsduh2011.sav'.

CSDESCRIPTIVES
/PLAN FILE= 'folder-path\nsduh11.csplan'
/SUMMARY VARIABLES= RSK1 RSK2 RSK3 RSK4
/SUBPOP TABLE= IRSEX DISPLAY=LAYERED
/MEAN
/STATISTICS se cin(95)
/MISSING SCOPE=ANALYSIS CLASSMISSING=EXCLUDE.

R specific sample code for the same analysis:

load("folder-path/nsduh2011.rda")
keepvars = c("VESTR", "VEREP", "ANALWT_C", "IRSEX", "RSKPKCIG" )
nsduh11 = nsduh2011[, keepvars]             #make a data file with fewer variables

library(prettyR)   # requires to install the prettyR package
nsduh11$RSKPKCIG<- as.numeric(sub("^\$0*([0-9]+)\$.+$", "\\1", nsduh11$RSKPKCIG))
nsduh11$IRSEX <- as.numeric(sub("^\$0*([0-9]+)\$.+$", "\\1", nsduh11$IRSEX))

library(survey) #needs to install the survey package
options( survey.lonely.psu = "adjust" )
desg <- svydesign(id = ~VEREP , strata = ~VESTR , weights = ~ANALWT_C , data = nsduh11 , nest = TRUE )

# calculate the means or proportions of RSKPKCIG by the levels of IRSEX and their SEs
out = svyby(~factor(RSKPKCIG), ~IRSEX, design = desg , FUN=svymean, vartype=c("se","ci"), na.rm = TRUE)
coef(out)      #extracting the means
SE(out)         #extracting the SEs of means
confint(out) # 95% confidence intervals of mean
print(out)     #all results

Note that the variance estimation variables (except the sampling weight variable) do not affect the mean, proportion, percent, and other first-order moment statistics. For example, in SUDAAN syntax code, the design=WR option and entire "nest VESRT VEREP;" statement in PROC DESCRIPT have no impact on mean/proportion estimates. However, the variance estimation variables must be used (e.g., design=option and nest statement as above) to produce the SE estimates of descriptive (e.g., SE of mean/proportion) and inferential statistics (e.g., confidence intervals of mean/proportion and p-value of testing hypothesis).

¹Prior to 2002, data were collected under the old title - National Household Survey on Drug Abuse (NHSDA)

²For further details on the sampling design and weighting adjustment method, please see the 2011 NSDUH Methodological Resource Book

Friday, March 7, 2014

What are the differences between NSDUH public-use and restricted-use data?

NSDUH public-use and restricted-use data differ in terms of access, availability, and variable groups. Users can reference the Analysis Options for NSDUH Public-use and Restricted-use Data page for help with determining which available option best meets their research needs: public-use (downloadable data), SDA (?) (online analysis of public-use data), R-DAS (?) (online analysis with disclosure restrictions), or the Data Portal (?) (virtual desktop access to restricted-use microdata).

Wednesday, February 26, 2014

How can I access the NCS-1, 1990-1992 study? I can no longer find it on the SAMHDA site.

The NCS-1, 1990-1992 study has been transferred from SAMHDA to the National Addiction & HIV Data Archive Program (NAHDAP) Archive. NCS-1 data and documentation files can now be accessed through the NAHDAP and ICPSR General Archive websites. We apologize for any inconvenience this transition may have caused.

Tuesday, February 11, 2014

Which variables are available in the restricted-use NSDUH data files?

The following variable crosswalk displays all variables in the restricted-use NSDUH data files and their availability for specific groups of years.

Restricted-use data variable crosswalk