Tuesday, March 25, 2014

How do I calculate variance of totals for NSDUH data?

The final analysis (poststratification adjustment) weight variable, ANALWT_C, in the NSDUH public-use file (PUF) file was adjusted for unequal probability of selection, nonresponse of respondents, and coverage bias of respondents to the poststratification population totals from United States Census 2000. All NSDUH surveys used U.S. Census 2000 except the2002 and 2003 surveys, which used the 1990 Census, and the 2004 survey, which used 50% from each of 1990 and 2000 Censuses. By using ANALWT_C, the weighted estimates of totals obtained for any of the survey variables are the (target population) estimates for the entire universe of civilian members of the noninstitutionalized population in the United States.

For the NSDUH PUF file, mixed method approaches are recommended for variance estimation of totals. Why mixed approaches? Because in some domain analyses, the estimated domain sizes are subject to sampling variability, and in other special domain analyses, the estimated domain sizes are not subject to sampling variability. We can obtain the variance estimates of totals for the former case directly from the procedures of software packages using the Taylor series linearization method. For the latter situation (i.e., fixed/controlled domain sizes), this variance can indirectly/manually be computed with the method discussed below (see page i-20 in the 2011 NSDUH Statistical Inference Report.) Note that variance estimates of the estimated total of a variable will not be reliable if it is obtained by a software package procedure for fixed-domains.

The analysis weight variable, ANALWT_C, was adjusted by a poststratification weight calibration method. With this method, a set of post-strata classes (i.e., demographic domains) were constructed using some demographic variables of respondents (such as age, gender, and/or race), and these demographic domains were forced to match their respective U.S. Census Bureau population estimates through the weight calibration process. Each estimated domain size of a domain in the NSDUH PUF data is equal to the corresponding population domain size estimate of U.S. Census data. This set of domains, considered controlled (i.e., those with fixed domain sizes), was restricted to main effects and two-way interactions in order to maintain continuity between years (age, gender, and race were used in the 2011 NSDUH  for constructing the domains; see page i-22 in the 2011 NSDUH codebook, and age and gender were used in the 2002 to 2010 NSDUH; see page i-21 in the 2010 NSDUH codebook). This is why the variance estimates for the estimated total of an interested variable for a domain, which is specific to post-strata class variable, or a combination of variables (such as age, gender, and/or race), or a combination of post-strata class and non-post-strata class variable(s), requires special attention to manually compute the variance estimate outside the procedure (for details about variance of totals and differences see pages i-20 in the 2011 NSDUH codebook). Examples of controlled domains are (i) the domains are defined by each of the two levels of Gender and (ii) the domains are constituted by the combination of the levels of two categories of Gender and three categories of Age. An example of non-controlled (i.e., random) domains may be defined by each of the four levels of EMPSTATY (imputation revised employment status) because EMPSTATY is not a post-stratification weight class variable. Whereas (iii) the domains defined by the combination of the levels between gender and EMPSTATY cannot be regarded as random or controlled domains.

Suppose a user desires to obtain the total of a continuous or an indicator (e.g., DEPNDALC, the alcohol dependence) variable, along with the standard error (SE), of the respondents by a domain variable. The output in each of the cells of the Table is for a domain analysis. The weighted total number of the respondents (i.e., the estimated sub/population size) in each cell is the sum of the analysis weight variable, denoted by DomainSize, and the weighted total estimate of DEPNDALC is the sum of the product of the analysis weight and DEPNDALC in observations, denoted by TotalDALC. The weighted mean alcohol dependence estimate, denoted by MeanDALC, is the ratio between TotalDALC and DomainSize. All software packages calculate the SE of TotalDALC, denoted by SE(TotalDALC), using a closed-form/analytic variance formula since TotalDALC is a linear statistic, and the SE of MeanDALC, denoted by SE(MeanDALC), using the Taylor linearization method (as default) since MeanDALC is a nonlinear statistic. For a fixed-domain, this SE estimate of SE(TotalDALC) of the estimated total of alcohol dependence will not be reliable. The TotalDALC and the MeanDALC estimates are always considered as random variables but the DomainSize estimate is not always treated as a random variable. How the DomainSize estimate is handled depends on the way the domains are formed in an analysis using NSDUH data. Situations for which a domain size will be treated as fixed/controlled for NSDUH PUF data were discussed earlier in this FAQ with examples.

The weighted total estimate of alcohol dependence in a cell can also be computed from the weighted total respondents multiplied by the weighted mean alcohol dependence of respondents, i.e., (by mean method) TotalDALC _M = DomainSize x MeanDALC. For controlled/fixed domains, an appropriate estimate of the SE for the total number of persons with a characteristic (e.g., alcohol dependence in this illustration) of interest is SE(TotalDALC_M)1 = (DomainSize) * SE(MeanDALC), where SE(MeanDALC) is computed correctly by the Taylor linearization method using the software package procedures. The point estimates of TotalDALC and TotalDALC_M are the same, but their SE estimates are substantially different. None of the software packages directly produce this SE estimate for totals using the above formula. Note again that for non-controlled domains, the users can directly obtain the variance of the estimated totals using the software packages.

Side note: The analysis weight variable in the DAWN data is also a poststratification adjusted weight but poststratification adjustments were implemented within the design strata to offset the (sampling frame) coverage bias on the estimates. No post-strata classes were constructed using any of the Emergency Department (ED) visit characteristic variables to force the class sizes to match the respective benchmark population totals of ED visits from the American Hospital Association (AHA) database by the weight calibration method; thus, unlike NSDUH, SE estimate of totals can be obtained directly from software package procedures.

Sample syntax to calculate SE of total for controlled Domains

All software packages produce the SE of a total statistic but this SE is not appropriate if this total estimate is obtained for a controlled domain. Using an individual year data, the following program code shows how to compute the SEs of the totals from controlled domains using the SUDAAN procedure in conjunction with the SAS data step and Stata's svy: mean procedure. The calculation of SE for totals also requires the SE of means. Users can find help with the statistical analysis plan for the later topic in the "How do I account the complex sampling design when analyzing NSDUH data?" FAQ.  

  • SUDAAN in conjunction with SAS sample code

    proc descript design=WR  data=work.nsduh11_analysis notsorted nomarg;
    nest vestr verep; weight analwt_c;
    class irsex/ nofreqs;
    tables irsex;
    var rsk1 rsk2 rsk3 rsk4;
    output wsum semean total /replace  filetype=sas filename=work.sud_result1
    wsumfmt=f15.3 semeanfmt=f12.10 totalfmt=f15.4;  #SE of mean with 10 decimal points to avoid rounding error
    run;

The wsum and total statistics requested in the output statement are the population size estimate (which is considered not subject to sampling variability) and the weighted total estimate in the Table cell. Note that the setotal statistic was not requested, as this SE is not appropriate.

    #Using SAS, compute: SE(total) = Domain_size*SE(mean);
    data work.sud_result1;
    set work.sud_result1;
    if semean gt 0.0 then setotal = wsum * semean;
    run;
    proc print data= work.sud_result1;
    run;

  • Stata sample code:

A user requires some knowledge of vector-matrix (element-wise) computation for this program.

    use drivename:\path\statadatafilename.dta
    svyset VEREP[pweight=ANALWT_C], strata(VESTR)  singleunit(centered)
    save drivename:\path\mystatadata.dta, replace

    svy: mean rsk1 rsk2 rsk3 rsk4, over (IRSEX)
    #extracts variance-covariance matrix for means
    matrix vcov = e(V)
    matrix var = vecdiag(vcov)
    #extracts weighted size (denominator totals of mean), i.e., domain sizes
    matrix wsum = e(_N_subp)
    matrix setotal = J(1,8,0)
    #compute SE: setotal = wsum * se(mean) for each total estimates
    local j=1
    while `j'<=8{
    matrix setotal[1,`j'] = wsum[1,`j']*sqrt(var[1,`j'])
    local j = `j' + 1
    }
    svy: total rsk1 rsk2 rsk3 rsk4, over (IRSEX)
    matrix totals = e(b)
    matrix result = (wsum', totals', setotal')
    matrix colnames result = Domain_size  Total  SEtotal
    matrix list result

  • SAS sample code:

    proc surveymeans data= work.nsduh11 nobs nmiss mean sum sumwgt NOMCAR;

    strata  vestr; cluster  verep;  weight  analwt_c;
    domain irsex;
    class rskpkcig;
    var  rskpkcig;
    ods output domain = mydomain;

    run;

    data work.myresult(keep=IRSEX RSKPKCIG N Nmiss Domain_wgt_size Totals SE_totals);

    set work.mydomain;
    SE_totals = sumwgt * stdErr;
    rename VarLevel=RSKPKCIG sum = Totals  sumwgt = Domain_wgt_size;

    run;

  • R sample code:

    load("folder-path/mydat.rda")     #2011 NSDUH PUF
    # save the loaded data with a new “mydat“ name # RSKPKCIG and IRSEX are factor variables; we need to make them numeric.
    library(prettyR)   # requires to install the prettyR package
    mydat$RSKPKCIG<- as.numeric(sub("^\\(0*([0-9]+)\\).+$", "\\1", mydat$RSKPKCIG))
    mydat$IRSEX <- as.numeric(sub("^\\(0*([0-9]+)\\).+$", "\\1", mydat$IRSEX))

    # compute the totals and SEs of totals [for controlled domains]
    # formula : total_hat = N * p_hat, so its SE is SE(total_hat) = N * SE(p_hat)
    totals1=aggregate(mydat$ANALWT_C,by=list(mydat$RSKPKCIG, mydat$IRSEX), FUN=sum, na.rm=TRUE)

    library(reshape)   #   reshape the package that has the RENAME statement command
    # c(oldVARname="newVARname")
    totals2 <- rename(totals1, c(Group.1="RSKPKCIG", Group.2="IRSEX",x="totals")) 
    # totals2 has no missing values
    domsize1=aggregate(totals2$totals,by=list(totals2$IRSEX), FUN=sum) 
    domsize2 <- rename(domsize1, c(Group.1="IRSEX",x="domain_size"))  
    # merging data frames
    domainSize = merge(domsize2, totals2, by=c("IRSEX"))

    require(survey)   # requires to install the survey package
    options( survey.lonely.psu = "adjust" )
    desg <- svydesign(id = ~VEREP , strata = ~VESTR , weights = ~ANALWT_C , data = mydat, nest = TRUE )

    # obtaining SEs of the proportions of the categorical variable, RSKPKCIG by IRSEX
    out = SE(svyby(~factor(RSKPKCIG), ~IRSEX, design = desg, FUN=svymean, vartype=c("se"), na.rm = TRUE) )

    library(reshape2)       #first install reshape2 and then use the library() command
    #reshape long to wide
    long2wide = dcast(domainSize, IRSEX ~ RSKPKCIG, value.var="domain_size")   long2wide$IRSEX = NULL
    N = as.matrix(long2wide)        # N = fixed (domain) sizes
    SE_p_bar = as.matrix(out)
    se_totals = N * SE_p_bar                     # elementwise matrix multiplication
    SE_totals = as.vector(t(se_totals))     # Row-wise vectorize   [row vector is a vector, column vector is a matrix]
    myresult=cbind(domainSize, as.data.frame (SE_totals))
    # re-arranging the variables
    myresult = myresult[c("IRSEX", "RSKPKCIG", "domain_size", "totals", "SE_totals")]

    print(myresult)

  • A related FAQ, How do I calculate variance of difference between totals for NSDUH data?, may provide additional useful information.



    1 See Section 5 in 2012 NSDUH Statistical Inference Report

    .

    2 comments:

    1. Hello,
      Thank you for sharing such a piece of wonderful information regarding SAP Evaluation, this blog helped me a lot, many need classes and I have been researching this since 2014 and I found a number 404-594-1770, which provides addiction counseling...Go for It...

      ReplyDelete