Friday, May 1, 2015

[WEBINAR] Online Analysis of SAMHSA Public-Use Data with Survey Documentation and Analysis (SDA)

Learn the fundamentals of analyzing SAMHSA public-use data online without specialized software or downloading of data. The Online Analysis of SAMHSA Public-Use Data with Survey Documentation and Analysis (SDA) webinar provides:

  • A general overview of SDA's interface, analytic functions, and available help resources; and
  • A demonstration of how to use SDA for variable recoding and computation, crosstabulation, comparison of means, and regression.

Thursday, March 12, 2015

How do I calculate variance of difference between totals for NSDUH data?

Before reading this document, users should review the "How do I calculate variance of totals for NSDUH data?" FAQ.

Sample syntax to calculate the SE of contrast between totals for controlled Domains of YEAR

The following programming code shows how to compute the SEs of the differences of two totals and the p-values for the tests of differences of two totals when total estimates are considered to be from controlled domains. In this example, we  will  test the differences between totals of DEPNDALC (alcohol dependence) for three levels of YEAR (2006 versus 2005 and 2007 versus 2005) using a combined file containing 2005-2007 data.

For controlled domains: SE of the contrast between totals1 (i.e., t2-t1) is:

SE(t2-t1) = sqrt(V(t1)+V(t2)-2*COV(t1,t2)),
where,
V(t1) = wsum1*wsum1*V(mean1) and
COV(t1,t2) = wsum1*wsum2*COV(mean1,mean2)

The covariance, COV(t1,t2), between two totals of t1 and t2 will be computed from the estimated covariance, COV(mean1,mean2), between the means or proportions. Note that these total estimates may take the form of total estimates from: (a) two different survey years (e.g., 2005 and 2006) within a combined file; (b) total estimates from sets of combined survey data (e.g., 2005-2006 annual totals and 2007-2008 annual totals) in a combined file where, for instance, ANALWT_C2 is an adjusted weight calculated by multiplying ANALWT_C by 0.5, or (c) total estimates for populations of interest within a single survey year. The following code is based on the data file of condition (a).

  • SUDAAN in conjunction with SAS code:

    proc descript data=work.nsduh5to7  design=WR  notsorted nomarg;
    nest vestr verep; weight analwt_c;
    var DEPNDALC;
    class YEAR/ nofreqs;
    diffvar YEAR=(6 5)/name="contrast Year6 vs 5";
    diffvar YEAR=(7 5)/name="contrast Year7 vs 5";
    output wsum total/replace filetype=sas filename=work.diff_total wsumfmt=f15.3 totalfmt=f15.4;
    run;

  • Previous sample syntax demonstrated how to calculate variance of the total for a controlled domain. The following pieces of code show how to compute the covariance between two totals. This is accomplished after extracting the covariance between means (from SUDAAN output work.cov1 in (1)) of respective domains.

    proc descript  data=work.nsduh5to7  design=WR  notsorted NOMARG;
    nest vestr verep; weight analwt_c;
    var DEPNDALC;
    class YEAR / nofreqs;
    tables YEAR;
    output / meancov=default filetype=sas filename=work.cov1 replace; * --------- (1);
    output wsum semean total /replace  filetype=sas filename=work.semean_domains
    wsumfmt=f15.3 semeanfmt=f12.10 totalfmt=f15.4;
    run;

    The following data step subsets the rows (i.e., rows when IDNUM=2) and columns (i.e., 3 B's and ROWNUM variables) to extract the variance-covariance matrix. As a result of the NOMARG option, there were 3 B's variables in work.cov101 because YEAR in the Tables statement has 3 levels.

    data work.cov1 (keep=rownum b01--b03);
    set work.cov101; /* by default adds 01 for meancov= option in (1) */
    if idnum = 2;   
    attrib _all_ label ='';
    run;

    The next data step will extract the covariance estimates between the first level and a successive level of YEAR. These estimates correspond to variable B01 for rownum=2 onwards.

    data work.covar (keep=covmean);
    set work.cov1;
    if rownum ne 1 then do; covmean=b01; end;
    if rownum ne 1 then output;
    run;

    data work.yr1(rename=(wsum=wsum1 total=total1) keep=YEAR wsum vmean1 total)
    work.yr2(rename=(wsum=wsum2 total=total2) keep=YEAR wsum vmean2 total)
    work.yr3(rename=(wsum=wsum2 total=total2 vmean3=vmean2) keep=YEAR wsum vmean3 total);

    set work.semean_domains;
    vmean1 = semean**2;
    vmean2 = semean**2;
    vmean3 = semean**2;
    if YEAR = 5 then output work.yr1;
    if YEAR = 6 then output work.yr2;
    if YEAR = 7 then output work.yr3;
    attrib _all_ label ='';
    run;

    data work.t12;
    merge work.yr1 work.yr2;
    data work.t13;
    merge work.yr1 work.yr3;
    data work.variances;
    merge work.t12 work.t13;
    by YEAR;
    run;

    In work.diff_total dataset, wsum is the sum of weights from two domains involved in the contrast and total is the contrast estimates between two respective totals.

    data work.contrast_totals(rename=(wsum=cntrwsum total=cntrtotal));
    length Cntr_Year $ 8;
    merge work.diff_total(keep=wsum total) work.variances work.covar;
    if year = 6 then do;
    Cntr_year = ‘y6 vs y5';
    secntrtotal = sqrt(wsum1**2*vmean1+wsum2**2*vmean2-2*wsum1*wsum2*covmean);
    end;
    if year = 7 then do;
    Cntr_year = ‘y7 vs y5';
    secntrtotal = sqrt(wsum1**2*vmean1+wsum2**2*vmean2/*-2*wsum1*wsum2*covmean*/);*?(2);
    end;
    if secntrtotal gt 0.0 then do;
    t_val = total/secntrtotal;
    p_value = 2*(1-probt(abs(t_val),60)); *df=120-60=60; *p-value computed for 2-sided test;
    end;
    run;

    Note that we deleted the covariance term from equation (2) above for nonconsecutive years (see footnote -20 on page-13 in the 2011 NSDUH Sample Design Report.)

      proc print data=work.contrast_totals noobs;
      var Cntr_Year cntrwsum cntrtotal secntrtotal t_val p_value;
      run;

    • Stata code for the same analysis:

    This assumes some knowledge of vector-matrix computation for the syntax.

    svy: mean DEPNDALC, over (YEAR)
    mat cov = e(V)
    svy: total DEPNDALC, over (YEAR)
    # extracts (numerator weighted) totals of DEPNDALC for each domain
    mat totals = e(b)
    # extracts denominator weighted sizes
    mat wsum = e(_N_subp)
    mat vmean = (cov[1,1],cov[2,2],cov[3,3])

    # compute contrast between totals and sum of the weights from two domains in contrast
    mat cntrtotal = (totals[1,2]-totals[1,1], totals[1,3]-totals[1,1])
    mat cntrwsum = (wsum[1,2]+wsum[1,1], wsum [1,3]+wsum [1,1])

    # compute the variances and covariances of totals
    mat vt1 = (wsum[1,1]^2*cov[1,1])
    mat vt2 = (wsum[1,2]^2*cov[2,2])
    mat vt3 = (wsum[1,3]^2*cov[3,3])
    mat cvt12 = (wsum[1,1]*wsum[1,2]*cov[1,2])
    mat cvt13 = (wsum[1,1]*wsum[1,3]*cov[1,3])

    # covariance-term from variance-formula of second contrast is omitted because of nonadjacent years
    mat secntrtotal = (sqrt(vt1[1,1]+vt2[1,1]-2.0*cvt12[1,1]), sqrt(vt1[1,1]+vt3[1,1]/*-2.0*cvt13[1,1]*/))
    mat t_val = (cntrtotal[1,1]/secntrtotal[1,1], cntrtotal[1,2]/secntrtotal[1,2])
    # p value for two-sided test (df=60=120-60)
    mat p_val = (2*ttail(60,t_val[1,1]), 2*ttail(60,t_val[1,2]))
    mat result = (cntrwsum', cntrtotal', secntrtotal', t_val', p_val')
    mat colnames result = CombWgtSize contrTotal SEcontrTotal t_stat pvalue
    mat rownames result = alcohol:Y6_vs_5 alcohol:Y7_vs_5
    # command below displays the pooled weighted size of two domains, difference between two totals, SE of difference, Student's t statistic, and the p-value of the test
    mat list result

    R does not currently compute the covariance matrix for the Taylor linearization method.



    1 See Sections 5 and 7 in 2012 NSDUH Statistical Inference Report.

    Monday, November 17, 2014

    [DUPLICATE FAQ] What is the Restricted-use Data Analysis System (R-DAS)?

    The R-DAS is an online analysis system that allows researchers to produce frequencies and cross-tabulations using restricted-use data files. The R-DAS provides output that is available for viewing and export. The R-DAS provides tables and frequencies. Advanced statistical methods are not available at this time.

    The R-DAS does not permit listing of individual cases and does not provide unweighted frequencies in the R-DAS codebook, nor are users able to generate unweighted frequencies (no unweighted sample sizes are provided). These limitations have been put in place to reduce the potential for disclosing confidential information.

    The R-DAS provides standard errors that take into account the complex survey design. All weighted totals and point estimates are rounded to the nearest thousand, and all percents and associated statistics are rounded to one decimal point. If any cell in a table contains too few unweighted cases, then the entire table is suppressed.

    The R-DAS does not currently allow for the creation of composite variables (i.e., the creation of new variables using other variables), but that capability is under development. The R-DAS does allow for recoding of existing continuous and categorical variables. See the SDA 3.5 help documentation for assistance with how to Temporarily Recode a Variable.

    R-DAS webinar

    Watch the "Broadening Access to Substance Abuse and Mental Health Data with the R-DAS" webinar to learn about the National Survey on Drug Use and Health (NSDUH) data available through the R-DAS.

    For more information on analyzing data with the R-DAS, consult the FAQ section on Help with the Restricted-use Data Analysis System (R-DAS).

    [WEBINAR] Broadening Access to Substance Abuse and Mental Health Data with the R-DAS

    Learn about the National Survey on Drug Use and Health (NSDUH) data available through the R-DAS. The Broadening Access to Substance Abuse and Mental Health Data with the R-DAS webinar provides:

    • A general understanding of the data and resources available through SAMHDA;
    • Differences between the public-use and restricted-use NSDUH data files;
    • Instructions on how to locate and access restricted-use NSDUH data in the R-DAS; and
    • A brief demonstration on how to create a cross-tabulation in the R-DAS.

    Thursday, November 6, 2014

    What are the differences between DAWN public-use and restricted-use data?

    DAWN public-use and restricted-use data differ in terms of access, availability, and variable groups. Users can reference the Analysis Options for DAWN Public-use and Restricted-use Data page for help with determining which available option best meets their research needs: public-use (downloadable data), SDA (?) (online analysis of public-use data), or the Data Portal (?) (virtual desktop access to restricted-use microdata).

    Tuesday, April 1, 2014

    When is the next application period for access to the Data Portal?

    The Data Portal call for applications ended on December 15, 2014.

    Sign up for SAMHDA News to be notified about future Data Portal call for applications.

    For information about the application process and data available through the Data Portal, please visit the Data Portal page. For further assistance with the Data Portal, please email dataportal@icpsr.umich.edu.

    Tuesday, March 25, 2014

    How do I calculate variance of totals for NSDUH data?

    The final analysis (poststratification adjustment) weight variable, ANALWT_C, in the NSDUH public-use file (PUF) file was adjusted for unequal probability of selection, nonresponse of respondents, and coverage bias of respondents to the poststratification population totals from United States Census 2000. All NSDUH surveys used U.S. Census 2000 except the2002 and 2003 surveys, which used the 1990 Census, and the 2004 survey, which used 50% from each of 1990 and 2000 Censuses. By using ANALWT_C, the weighted estimates of totals obtained for any of the survey variables are the (target population) estimates for the entire universe of civilian members of the noninstitutionalized population in the United States.

    For the NSDUH PUF file, mixed method approaches are recommended for variance estimation of totals. Why mixed approaches? Because in some domain analyses, the estimated domain sizes are subject to sampling variability, and in other special domain analyses, the estimated domain sizes are not subject to sampling variability. We can obtain the variance estimates of totals for the former case directly from the procedures of software packages using the Taylor series linearization method. For the latter situation (i.e., fixed/controlled domain sizes), this variance can indirectly/manually be computed with the method discussed below (see page i-20 in the 2011 NSDUH Statistical Inference Report.) Note that variance estimates of the estimated total of a variable will not be reliable if it is obtained by a software package procedure for fixed-domains.

    The analysis weight variable, ANALWT_C, was adjusted by a poststratification weight calibration method. With this method, a set of post-strata classes (i.e., demographic domains) were constructed using some demographic variables of respondents (such as age, gender, and/or race), and these demographic domains were forced to match their respective U.S. Census Bureau population estimates through the weight calibration process. Each estimated domain size of a domain in the NSDUH PUF data is equal to the corresponding population domain size estimate of U.S. Census data. This set of domains, considered controlled (i.e., those with fixed domain sizes), was restricted to main effects and two-way interactions in order to maintain continuity between years (age, gender, and race were used in the 2011 NSDUH  for constructing the domains; see page i-22 in the 2011 NSDUH codebook, and age and gender were used in the 2002 to 2010 NSDUH; see page i-21 in the 2010 NSDUH codebook). This is why the variance estimates for the estimated total of an interested variable for a domain, which is specific to post-strata class variable, or a combination of variables (such as age, gender, and/or race), or a combination of post-strata class and non-post-strata class variable(s), requires special attention to manually compute the variance estimate outside the procedure (for details about variance of totals and differences see pages i-20 in the 2011 NSDUH codebook). Examples of controlled domains are (i) the domains are defined by each of the two levels of Gender and (ii) the domains are constituted by the combination of the levels of two categories of Gender and three categories of Age. An example of non-controlled (i.e., random) domains may be defined by each of the four levels of EMPSTATY (imputation revised employment status) because EMPSTATY is not a post-stratification weight class variable. Whereas (iii) the domains defined by the combination of the levels between gender and EMPSTATY cannot be regarded as random or controlled domains.

    Suppose a user desires to obtain the total of a continuous or an indicator (e.g., DEPNDALC, the alcohol dependence) variable, along with the standard error (SE), of the respondents by a domain variable. The output in each of the cells of the Table is for a domain analysis. The weighted total number of the respondents (i.e., the estimated sub/population size) in each cell is the sum of the analysis weight variable, denoted by DomainSize, and the weighted total estimate of DEPNDALC is the sum of the product of the analysis weight and DEPNDALC in observations, denoted by TotalDALC. The weighted mean alcohol dependence estimate, denoted by MeanDALC, is the ratio between TotalDALC and DomainSize. All software packages calculate the SE of TotalDALC, denoted by SE(TotalDALC), using a closed-form/analytic variance formula since TotalDALC is a linear statistic, and the SE of MeanDALC, denoted by SE(MeanDALC), using the Taylor linearization method (as default) since MeanDALC is a nonlinear statistic. For a fixed-domain, this SE estimate of SE(TotalDALC) of the estimated total of alcohol dependence will not be reliable. The TotalDALC and the MeanDALC estimates are always considered as random variables but the DomainSize estimate is not always treated as a random variable. How the DomainSize estimate is handled depends on the way the domains are formed in an analysis using NSDUH data. Situations for which a domain size will be treated as fixed/controlled for NSDUH PUF data were discussed earlier in this FAQ with examples.

    The weighted total estimate of alcohol dependence in a cell can also be computed from the weighted total respondents multiplied by the weighted mean alcohol dependence of respondents, i.e., (by mean method) TotalDALC _M = DomainSize x MeanDALC. For controlled/fixed domains, an appropriate estimate of the SE for the total number of persons with a characteristic (e.g., alcohol dependence in this illustration) of interest is SE(TotalDALC_M)1 = (DomainSize) * SE(MeanDALC), where SE(MeanDALC) is computed correctly by the Taylor linearization method using the software package procedures. The point estimates of TotalDALC and TotalDALC_M are the same, but their SE estimates are substantially different. None of the software packages directly produce this SE estimate for totals using the above formula. Note again that for non-controlled domains, the users can directly obtain the variance of the estimated totals using the software packages.

    Side note: The analysis weight variable in the DAWN data is also a poststratification adjusted weight but poststratification adjustments were implemented within the design strata to offset the (sampling frame) coverage bias on the estimates. No post-strata classes were constructed using any of the Emergency Department (ED) visit characteristic variables to force the class sizes to match the respective benchmark population totals of ED visits from the American Hospital Association (AHA) database by the weight calibration method; thus, unlike NSDUH, SE estimate of totals can be obtained directly from software package procedures.

    Sample syntax to calculate SE of total for controlled Domains

    All software packages produce the SE of a total statistic but this SE is not appropriate if this total estimate is obtained for a controlled domain. Using an individual year data, the following program code shows how to compute the SEs of the totals from controlled domains using the SUDAAN procedure in conjunction with the SAS data step and Stata's svy: mean procedure. The calculation of SE for totals also requires the SE of means. Users can find help with the statistical analysis plan for the later topic in the "How do I account the complex sampling design when analyzing NSDUH data?" FAQ.  

    • SUDAAN in conjunction with SAS sample code

      proc descript design=WR  data=work.nsduh11_analysis notsorted nomarg;
      nest vestr verep; weight analwt_c;
      class irsex/ nofreqs;
      tables irsex;
      var rsk1 rsk2 rsk3 rsk4;
      output wsum semean total /replace  filetype=sas filename=work.sud_result1
      wsumfmt=f15.3 semeanfmt=f12.10 totalfmt=f15.4;  #SE of mean with 10 decimal points to avoid rounding error
      run;

    The wsum and total statistics requested in the output statement are the population size estimate (which is considered not subject to sampling variability) and the weighted total estimate in the Table cell. Note that the setotal statistic was not requested, as this SE is not appropriate.

      #Using SAS, compute: SE(total) = Domain_size*SE(mean);
      data work.sud_result1;
      set work.sud_result1;
      if semean gt 0.0 then setotal = wsum * semean;
      run;
      proc print data= work.sud_result1;
      run;

    • Stata sample code:

    A user requires some knowledge of vector-matrix (element-wise) computation for this program.

      use drivename:\path\statadatafilename.dta
      svyset VEREP[pweight=ANALWT_C], strata(VESTR)  singleunit(centered)
      save drivename:\path\mystatadata.dta, replace

      svy: mean rsk1 rsk2 rsk3 rsk4, over (IRSEX)
      #extracts variance-covariance matrix for means
      matrix vcov = e(V)
      matrix var = vecdiag(vcov)
      #extracts weighted size (denominator totals of mean), i.e., domain sizes
      matrix wsum = e(_N_subp)
      matrix setotal = J(1,8,0)
      #compute SE: setotal = wsum * se(mean) for each total estimates
      local j=1
      while `j'<=8{
      matrix setotal[1,`j'] = wsum[1,`j']*sqrt(var[1,`j'])
      local j = `j' + 1
      }
      svy: total rsk1 rsk2 rsk3 rsk4, over (IRSEX)
      matrix totals = e(b)
      matrix result = (wsum', totals', setotal')
      matrix colnames result = Domain_size  Total  SEtotal
      matrix list result

  • SAS sample code:

    proc surveymeans data= work.nsduh11 nobs nmiss mean sum sumwgt NOMCAR;

    strata  vestr; cluster  verep;  weight  analwt_c;
    domain irsex;
    class rskpkcig;
    var  rskpkcig;
    ods output domain = mydomain;

    run;

    data work.myresult(keep=IRSEX RSKPKCIG N Nmiss Domain_wgt_size Totals SE_totals);

    set work.mydomain;
    SE_totals = sumwgt * stdErr;
    rename VarLevel=RSKPKCIG sum = Totals  sumwgt = Domain_wgt_size;

    run;

  • R sample code:

    load("folder-path/mydat.rda")     #2011 NSDUH PUF
    # save the loaded data with a new “mydat“ name # RSKPKCIG and IRSEX are factor variables; we need to make them numeric.
    library(prettyR)   # requires to install the prettyR package
    mydat$RSKPKCIG<- as.numeric(sub("^\\(0*([0-9]+)\\).+$", "\\1", mydat$RSKPKCIG))
    mydat$IRSEX <- as.numeric(sub("^\\(0*([0-9]+)\\).+$", "\\1", mydat$IRSEX))

    # compute the totals and SEs of totals [for controlled domains]
    # formula : total_hat = N * p_hat, so its SE is SE(total_hat) = N * SE(p_hat)
    totals1=aggregate(mydat$ANALWT_C,by=list(mydat$RSKPKCIG, mydat$IRSEX), FUN=sum, na.rm=TRUE)

    library(reshape)   #   reshape the package that has the RENAME statement command
    # c(oldVARname="newVARname")
    totals2 <- rename(totals1, c(Group.1="RSKPKCIG", Group.2="IRSEX",x="totals")) 
    # totals2 has no missing values
    domsize1=aggregate(totals2$totals,by=list(totals2$IRSEX), FUN=sum) 
    domsize2 <- rename(domsize1, c(Group.1="IRSEX",x="domain_size"))  
    # merging data frames
    domainSize = merge(domsize2, totals2, by=c("IRSEX"))

    require(survey)   # requires to install the survey package
    options( survey.lonely.psu = "adjust" )
    desg <- svydesign(id = ~VEREP , strata = ~VESTR , weights = ~ANALWT_C , data = mydat, nest = TRUE )

    # obtaining SEs of the proportions of the categorical variable, RSKPKCIG by IRSEX
    out = SE(svyby(~factor(RSKPKCIG), ~IRSEX, design = desg, FUN=svymean, vartype=c("se"), na.rm = TRUE) )

    library(reshape2)       #first install reshape2 and then use the library() command
    #reshape long to wide
    long2wide = dcast(domainSize, IRSEX ~ RSKPKCIG, value.var="domain_size")   long2wide$IRSEX = NULL
    N = as.matrix(long2wide)        # N = fixed (domain) sizes
    SE_p_bar = as.matrix(out)
    se_totals = N * SE_p_bar                     # elementwise matrix multiplication
    SE_totals = as.vector(t(se_totals))     # Row-wise vectorize   [row vector is a vector, column vector is a matrix]
    myresult=cbind(domainSize, as.data.frame (SE_totals))
    # re-arranging the variables
    myresult = myresult[c("IRSEX", "RSKPKCIG", "domain_size", "totals", "SE_totals")]

    print(myresult)

  • A related FAQ, How do I calculate variance of difference between totals for NSDUH data?, may provide additional useful information.



    1 See Section 5 in 2012 NSDUH Statistical Inference Report

    .