Monday, October 21, 2013

How can I access the HBSC series? I can no longer find HBSC on the SAMHDA site.

The HBSC series has been transferred from SAMHDA to the National Addiction & HIV Data Archive Program (NAHDAP) Archive. All HBSC data and documentation files can be accessed through the NAHDAP and ICPSR General Archive websites. We apologize for any inconvenience this transition may have caused.

Thursday, September 19, 2013

Do I have to be concerned about disclosure when using the NSDUH data?

The NSDUH data provided through SAMHDA by the Center for Behavioral Health Statistics and Quality (CBHSQ) are to be used for research and statistical purposes only. The data must not be used to identify a respondent. To reduce the risk of respondent identification, CBHSQ uses a number of disclosure limitation methods on the NSDUH data. For published estimates, no further disclosure limitation methods need to be applied.

The public-use files and the corresponding estimates from the SAMHDA online analysis system (SDA) also have disclosure limitation steps applied to the data and therefore, no further steps need to be taken by the data user. For details on the disclosure limitation methodology used, please refer to the introductory text in the codebook for a given year.

The R-DAS data files have additional disclosure limitation protections applied to them. Tables are only produced when certain minimum cell size and other criteria are met for all cells. The output is also limited to weighted estimates (rounded to the nearest thousand) and no unweighted sample sizes are produced. Therefore, a user does not have to be concerned with disclosure if R-DAS produces a table for the user.

Beyond a finite set of sample size tables that are released to the public, CBHSQ does not make detailed sample sizes available to the public. This policy is intended to minimize potential disclosure risk. CBHSQ requires unweighted sample size numbers to be rounded to the nearest hundred when these numbers are generated from restricted-use data files.

Data Portal users must also round sample size numbers to the nearest hundred prior to using information outside of the Data Portal.

Wednesday, August 14, 2013

What are the technical details on the complex sample design for DAWN?

Primary sampling units (PSUs) are hospitals within strata and secondary sampling units (SSUs) are records of emergency department (ED) visits within PSUs. Some hospitals chosen with a probability equal to one in the first stage of sampling are "certainty hospitals." This means that all hospitals in a stratum are selected. So, the finite population correction factor (1-fh) is zero for those strata with certainty hospitals (since sampling was without replacement (WOR) for other strata from finite populations), and consequently there will be no variance contribution to those strata at the first-stage sampling. Where fh=nh/Nh, nh is the count of hospitals in h-th stratum and Nh is the corresponding (population) frame count given in the variable PSUFRAME. The records of the ED visits of such certainty hospitals were randomly chosen, i.e., visits were not a complete enumeration. To take into account the within-hospital variation for ED visits, the DAWN PUF provides the additional design variable, REPLICATE, for the second stage of sampling, which is required for the correct statistical inferences. In sum, each of the strata have at least 2 hospitals (PSUs) and each of the hospitals have exactly two replicates (SSUs); and each of the replicates should have numerous ED visit records.

There are some issues with variance estimation when using the Taylor method and the calculations of degrees of freedom that should be noted. The SAS, SPSS, Stata, SUDAAN, and R software packages calculate the variance contribution for each stage of the design using the deviations between the unit's value (i.e., total) and the mean of all units' values within the stage. (Unit indicates the PSU and the SSU for the first and second sampling stages.) There are no single unit (i.e., singleton) strata in the DAWN PUF data, but certain analyses may encounter singleton-stratum while calculating the variance for a domain or subclass/subgroup or subpopulation. Singleton-stratum is when a single unit (PSU or SSU) has at least one observation and other units have no observation in that stratum. Units with no observations are handled in different ways by the different software packages when calculating the variance and degrees of freedom. The MISSUNIT option in SUDAAN, singleunit(center) option in Stata, and options(survey.lonely.psu = "adjust") in R handle such cases by calculating the variance contribution for those singleton-strata using the deviations of that unit-total value and the grand mean of the sample. By default, SPSS handles this situation based on the assumption that there was at least one other unit (if not, then that stratum contributes a null variance) in that stage in the sample and thus units with no observation (sampling zeroes) would have unit totals as 0 and definitely would contribute to the stratum-variance. Moreover, an analysis can also encounter some strata with no observations (empty strata). Users may experience such a situation in domain analysis. The question is how the variance and the degrees of freedom computation handle this situation to account for the design effect into the overall variance and the degrees of freedom by the software packages. SUDAAN assumes there were actually no strata with non-missing cases in the population, but strata with missing cases as part of sample of selection. SUDAAN and R treat those missing units as sampling zeros. Thus, each of the empty strata contributes zero variance into the overall variance and in sequel, contributes to the degrees of freedom. The Stata software package does this by default method; but certain Stata procedures have options, for instance the singleunit(center) option, which digress from this assumption of sampling zeroes and consider such empty strata as structural zeros. The logic is that when a stratum has no cases at all then one should assume that this stratum is not part of the sampling for domains and they would contribute null to the overall variance. In Stata, the degrees of freedom determined with the singleunit(center) option are smaller than that obtained by the default method in those instances where domains are not in common. Note that the variance estimates from SUDAAN, Stata, and R software packages are always the same with the options stated above whether the assumption of sampling zeroes is retained or overlooked, but they produce different degrees of freedom.

The calculation of degrees of freedom (df) is crucial for all these software packages and influences calculation of inferential statistics such as confidence intervals and p-values of test statistics. Conventionally, the df is calculated by the fixed-PSU method and the 'fixed' df is defined by the number of PSUs minus number of strata for the first stage in the sample design with any number of stages of sampling (i.e., from the full data file). The SPSS and R software packages always use this fixed-PSU method for calculating the df in all aspects of analysis. This is the default setting in SUDAAN, but users can provide a predetermined number as df with the user interactive DDF= option. This fixed-PSU method is also the default in Stata, but this package has options that invoke Stata procedures to calculate an alternate df by the method known as variable-PSU method. For example, Stata with the singleunit(center) option uses the variable-PSU method for calculating the df and the variable df is calculated as the number of non-empty PSUs minus the number of non-empty strata. The number of non-empty PSUs is the number of PSUs in the sample MINUS the number of PSUs with no observation in all singleton strata. Users can manually calculate the 'variable' df for a domain analysis and specify it in SUDAAN with DDF=df parameter option in the PROC statement or specify 'design' df in Stata with svy, dof(df): in order to compare the estimates of inferential statistics across software packages.

SAMHDA's online data analysis system (called SDA) calculates slightly different but appropriate df as the number of PSUs in non-empty strata minus the number of non-empty strata. SAS, SPSS, and SDA handle singleton-strata almost equally. Note that SAS and SDA can only take into account the 1st stage sampling design effects. DAWN data in SDA use a modified (pseudo) single-stage stratified cluster sample that was prepared for compatibility with SDA's complex survey data analysis capability.

For related technical information, please see the FAQ: Accounting for the effects of complex sampling design (design effects) when analyzing DAWN data.

Friday, May 17, 2013

How do I account for effects of complex sampling design (design effects) when analyzing DAWN data?

The DAWN (Drug Abuse Warning Network) employs a two-stage (stratified cluster) sample design for the selection of hospital emergency department (ED) visits caused or contributed to by drugs. The DAWN public-use file (PUF) includes the following complex design variables: variance estimation stratum (STRATA), primary sampling unit (PSU), secondary sampling unit (REPLICATE), PSU frame size (PSUFRAME), and analysis case weight (CASEWGT). The DAWN PUF has no missing values in the design variables STRATA, PSU, PSUFRAME, REPLICATE, and CASEWGT; however, analysis variables can have missing values.

The default method for estimating standard errors is Taylor series linearization for SAS, SPSS, SUDAAN, Stata, and R software, but SAS can only account the variance contribution from the first-stage. SAS is not currently able to fully and properly account for the DAWN sampling design.

Example code/syntax specific to each statistical software package given below uses the DAWN 2010 PUF. The examples estimate the proportions (or means) of alcohol-related ED visits along with the standard errors (SE) and confidence intervals by racial group. In short, the statistical analysis plan (SAP) is to obtain the mean and its SE along with confidence intervals and other related statistics. Note that the variable ALCOHOL is coded as 0 and 1; thus, the mean of ALCOHOL is namely the proportion of ALCOHOL=1.

Users should read the help document (of her/his respective statistical package) regarding how missing values are being handled if any exist in the analysis variables.

The following SUDAAN stand-alone code uses all of the design variables provided in the PUF for appropriate calculation of the variance of means/proportions.

Proc descript design=WOR filetype=sasxport data="folder-path\dawn2010.xpt" notsorted;

nest STRATA PSU REPLICATE / MISSUNIT;
totcnt PSUFRAME _minus1_ _zero_;
weight CASEWGT;
class RACE;
table RACE;
var ALCOHOL;
print mean semean lowmean upmean
/style=nchs meanfmt=f10.8 semeanfmt=f10.8 lowmeanfmt=f10.8 upmeanfmt=f10.8;

The SPSS specific code for the same analysis is:

Get file='path\dawn2010puf.sav'.
CSPLAN ANALYSIS
/PLAN FILE='folder-path\dawn_stage2.csplan'
/PLANVARS ANALYSISWEIGHT= casewgt
/DESIGN STRATA= strata CLUSTER= psu
/ESTIMATOR TYPE = EQUAL_WOR
/POPSIZE VARIABLE=psuframe
/DESIGN CLUSTER=REPLICATE
/ESTIMATOR TYPE = WR.

CSDESCRIPTIVES
/PLAN FILE= 'folder-path\dawn_stage2.csplan'
/SUMMARY VARIABLES= alcohol
/SUBPOP TABLE= race
/MEAN
/STATISTICS SE CIN(95)
/MISSING SCOPE=ANALYSIS CLASSMISSING=EXCLUDE.

Stata specific code for the same analysis is (note that Stata commands are case sensitive):

use drivename:\path\statadatafilename.dta

gen STRATA2 = PSU  # svydescribe requires distinct naming of design variables in different stages
svyset PSU[pweight=CASEWGT], strata(STRATA) fpc(PSUFRAME) || REPLICATE, strata( STRATA2)
singleunit(centered)       # note: this line wraps from previous line
save drivename:\path\mystatadata.dta, replace

svy: mean ALCOHOL, over(RACE)
estat strata
estat effects, deff

The SAS code for this analysis, disregarding the second stage variance, is:
(with replacement at the first stage of sampling; ignoring the finite population correction)

proc surveymeans data=libname.dataSetName mean clm nomcar;
strata strata;
cluster psu;
weight casewgt;

domain race;
var alcohol;
run;

R specific code for the same analysis:

load("input-path-of-folder/34083-0001-Data.rda")  #ICPSR DAWN PUF 2010 (R data)
#da31921.001.rda is created in R console from this load command
dawn9 = da31921.0001 
rm(da31921.0001)    # to free up RAM, remove the full r data frame
keepvars = c("STRATA", "PSU", "REPLICATE", "PSUFRAME", "CASEWGT", "RACE" , "ALCOHOL")
mydat = dawn9[, keepvars]           #data with a subset variables

# ALCOHOL and RACE are factor variables; we need to make them numeric and removing value labels
library(prettyR)   # this command requests for the prettyR package to be installed
mydat$ALCOHOL <- as.numeric(sub("^\\(0*([0-9]+)\\).+$", "\\1", mydat$ALCOHOL))
mydat$RACE <- as.numeric(sub("^\\(0*([0-9]+)\\).+$", "\\1", mydat$RACE))
# to prepare the fpc variable for the 1st stage of the 2-stage variance estimation design

strpsu = unique(mydat[,c("STRATA","PSU")])
strpsu$one = 1
strpsu=aggregate(strpsu$one,by=list(strpsu$STRATA), FUN=sum, na.rm=TRUE)
library(reshape)   #to rename the function
strsample = rename(strpsu, c(Group.1="STRATA", x="PSUsample"))
strframe = unique(mydat[,c("STRATA","PSUFRAME")])
strframe = strframe[order(strframe$STRATA),]
str = merge(strframe, strsample, by=c("STRATA"))
str = transform( str, n.over.N = PSUsample / PSUFRAME )
str <- subset(str, select=-c(PSUFRAME))

mydat = merge(mydat, str, by=c("STRATA"))
rm(str, strframe, strsample, strpsu)
# survey design for the Taylor-series linearization method
library(survey)  # this command requests for the survey package to be installed
options(survey.lonely.psu = "adjust" )
# create a survey design object (desg) with the DAWN design information
mydat$zero=0    # for 2nd stage fpc
desg <- svydesign(id =~PSU + REPLICATE , strata =~STRATA + PSU, fpc =~n.over.N + zero, weights =~CASEWGT, data = mydat ,  nest = TRUE )
# to calculate the means or proportions of ALCOHOL by RACE:
out = svyby(~ALCOHOL, ~RACE, design = desg , FUN=svymean, vartype=c("se","ci"), na.rm = TRUE)
print(out)   # confidence intervals for 95%

Note that the design variables (except the sampling weight variable) do not affect first-order moment statistics (such as mean, proportion, percent, totals and weighted counts). However, the design variables must be used to produce the SE estimates of descriptive and inferential statistics.

The target population of hospitals was divided into disjoint strata (denoted as STRATA) such that every hospital in the population belongs to exactly one stratum. Hospitals are the PSUs. In the first stage of sampling, within each stratum, hospitals were selected by the simple random sampling (SRS) without replacement for finite populations. Stratum size (denoted by PSUFRAME) is the population count of hospitals (i.e., PSUs) contained in the stratum. Depending upon the year and the size of the hospital, either all REPLICATEs were selected, which resulted in no contribution to the second stage variance for that hospital, or second-stage sampling at rates as small as 1/3 was performed using systematic random sampling. This resulted in sample files that averaged about 300,000 visits per year. Tabulation and analysis of these large sample files when many variables were involved proved to be computationally very expensive due to the amount of time necessary to calculate exact estimates of the second stage variance. In multi-stage sampling, the greatest portion of the variability occurs in the first stage; thus, the second-stage contribution is usually a small fraction of the first-stage variance. Realizing that unacceptable amounts of time were required to calculate the estimates of a small fraction of total variance, an approximation method was adopted that made variance estimation for large tables with many variables practical. The sampled visits within each hospital were sorted into two groups, and the between group variance was calculated and substituted for the standard second-stage variance. DAWN research simulations showed that the above approximation is sufficient for DAWN precision and efficient for DAWN standard error calculation. Note that STRATA is the first stage and PSU is the second stage stratification variable in the sample. Proper use of the REPLICATE variable should account for within-hospital variation, i.e., the second stage variance. The final analysis weight variable, CASEWGT, was adjusted for unequal probability of selection, nonresponse of hospitals, and coverage bias of ED visits to the benchmark population totals of ED visits from American Hospital Association (AHA) database.

CASEWGT is a poststratification adjusted analysis weight variable. By using CASEWGT, the weighted estimates of counts or totals obtained for any of the survey variables are the estimates for the entire universe of DAWN-eligible hospitals in the United States. Poststratification adjustments were implemented within the design strata to offset the coverage bias on the estimates. That means post-strata were not constructed from any of the ED visit characteristic variables. Therefore, the software packages correctly take into account the design effects on the variance estimates of estimated counts or totals for variables related to DAWN visits. For further details on sample design and weighting adjustment, please see the DAWN Methodology Report.

For additional technical information on options and how the various statistical software handle issues with missing values and degrees of freedom, please see the FAQ: Technical issues of sampling design analysis of DAWN data.

Wednesday, April 10, 2013

How do I perform a homogeneity test of proportions or percentages in the R-DAS?

The R-DAS does not have the Comparison of Means analysis available. The Frequencies/Crosstabulation program has an option of Summary Statistics to perform a test of independence (or no association) between two categorical variables using the Rao-Scott F statistics. These statistics take the complex design effect into account. The test of independence of a two-way contingency table is equivalent to the test of homogeneity of row (or column) percents (StataCorp 2011, page: 141-142). The null hypothesis for the later test is that row (or column) percents are equal for every category of the column (or row) variable.

For example, if your variable of interest is levels of alcohol consumption in the Column field, you can use the row percentage option and the resulting table output to approximately determine homogeneity (or the lack of homogeneity) of row (ethnic) groups among the levels of alcohol consumption. In other words, this test determines whether the distributions of each of the ethnic groups (among the alcohol levels) are equal.

The Rao-Scott F statistics are calculated from the contingency table for Row by Column variables. The test is significant at x% level of significance if the p-value of the Rao-Scott F statistic is less than x%. And, overall, the test concludes that there is association (or dependence) between Row and Column variables. The first screenshot shows RACE4 x ALCREC (recoded) table output for Total percent with the Summary Statistics box checked. From this table display of cell percent (i.e., total percent), confidence intervals and weighted cell frequencies, it is difficult to compare the prevalence of alcohol in different racial groups. In order to interpret the table output with regard to the test of homogeneity, we have to look at the table display in a better way. The second screenshot shows the contingency table output for Row percent. A larger percent (i.e., 63.7%) of whites have consumed alcohol within the last 30 days than blacks, other, or Hispanics. Since we only changed the way the percentages are displayed, the Rao-Scott F statistic is identical for both screenshots.

Reference:

StataCorp. 2011. Stata Survey Data Reference Manual, Release 12. Statistical Software. College Station, TX: StataCorp LP.

Monday, April 8, 2013

Is there a way to compare multiple means using the MEANS analytic option in SDA?

Yes. There are ways to compare multiple means (k-1 comparisons) using the Comparison of Means Program in SDA. Note that if a dependent variable is coded as 0/1, then the mean of the dependent variable is essentially the proportion.

In the Means Program, there are different dropdown options when selecting from the main statistic to display box. The default setting is to display the Means of the dependent variable against the required Row variable categories. Row variable categories define domains for subpopulation analysis. The selection of differences from Row category option allows you to choose a base category in the If differences from a row or column, indicate base category box. When you run the analysis for this selection, the result produced in each of the other row cells is the difference between that cell's mean and the base Row category cell mean. The above selection along with the selection of the z/t-statistic and p-value options produce the (k-1) comparisons of means or proportions and associated statistics from a single comparison of means run. This is a comparison of domain means for an outcome variable where domains are defined by a Row-only variable categories.

NOTES:

  1. In comparison of means testing, there are k(k-1)/2 differences of means or proportions being compared from k domains (or subpopulations or subgroups). In a single table run, the Means program enables us to simultaneously test the differences of (k-1) pair of means from a base category mean of a k category classification variable. So you will have k(k-1)/2 distinct comparison tests after choosing a different base category from separate (k-1) Means program runs. For a Row variable with k=4 levels (say, A, B, C and D), you will obtain 6 tests of difference of means (i.e., B-A, C-A, D-A for base reference A, C-B, D-B for base reference B and D-C for base reference C) from 3 (=4-1) separate MEANS program run when A, B, and C are used as base categories respectively.
  2. The computing in the R-DAS analysis system is equivalent to the Frequencies/Crosstabulation program module in SDA. It is not possible to perform a (k-1) means or proportions comparison using the R-DAS or SDA Frequencies/Crosstabulation program, while it is possible to perform a test of homogeneity of row (or column) percents for a two-way table. There is a separate FAQ on How do I perform a homogeneity of proportions or percentages using the Frequencies/Crosstab program in R-DAS?

Additional information on using the "Main statistic to display" option in the MEANS program can be found here:  http://www.icpsr.umich.edu/SDAHELP/helpan.htm#mstats

Monday, March 11, 2013

How do I use setup files to import plain text (ASCII) data?

Many SAMHDA data collections that contain ASCII data files are accompanied by setup files that allow users to read the text files into statistical software packages. Since a visual interpretation of alphanumeric data files is inefficient, statistical software is needed to define, manipulate, extract, and analyze variables and cases within data files. SAMHDA currently provides setup files for SAS, SPSS, and Stata statistical software packages.

Utilizing Setup Files

ICPSR has prepared tutorials on how to analyze data using setup files:

  • ASCII Data File + SAS Setup Files: PDF PPT

  • ASCII Data File + SPSS Setup Files: PDF PPT

  • ASCII Data File + Stata Setup Files: PDF PPT

You can find video tutorials addressing this topic on the ICPSR YouTube channel.

Troubleshooting Setup Files

Many statistical packages will not run a set-up file unless you reset the Windows default setting that hides file extensions.

Resetting File Extensions

In Windows 7 and Windows XP, file extensions are hidden by default; however, SAS, SPSS and Stata need to see the file extension to run setup files. ICPSR has created a short tutorial to show you how to reset the file extension option in Windows, so your setup files will run properly.

You can find video tutorials addressing this topic on the ICPSR YouTube channel.