Monday, October 21, 2013

How can I access the HBSC series? I can no longer find HBSC on the SAMHDA site.

The HBSC series has been transferred from SAMHDA to the National Addiction & HIV Data Archive Program (NAHDAP) Archive. All HBSC data and documentation files can be accessed through the NAHDAP and ICPSR General Archive websites. We apologize for any inconvenience this transition may have caused.

Thursday, September 19, 2013

Do I have to be concerned about disclosure when using the NSDUH data?

The NSDUH data provided through SAMHDA by the Center for Behavioral Health Statistics and Quality (CBHSQ) are to be used for research and statistical purposes only. The data must not be used to identify a respondent. To reduce the risk of respondent identification, CBHSQ uses a number of disclosure limitation methods on the NSDUH data. For published estimates, no further disclosure limitation methods need to be applied.

The public-use files and the corresponding estimates from the SAMHDA online analysis system (SDA) also have disclosure limitation steps applied to the data and therefore, no further steps need to be taken by the data user. For details on the disclosure limitation methodology used, please refer to the introductory text in the codebook for a given year.

The R-DAS data files have additional disclosure limitation protections applied to them. Tables are only produced when certain minimum cell size and other criteria are met for all cells. The output is also limited to weighted estimates (rounded to the nearest thousand) and no unweighted sample sizes are produced. Therefore, a user does not have to be concerned with disclosure if R-DAS produces a table for the user.

Beyond a finite set of sample size tables that are released to the public, CBHSQ does not make detailed sample sizes available to the public. This policy is intended to minimize potential disclosure risk. CBHSQ requires unweighted sample size numbers to be rounded to the nearest hundred when these numbers are generated from restricted-use data files.

Data Portal users must also round sample size numbers to the nearest hundred prior to using information outside of the Data Portal.

Wednesday, August 14, 2013

What are the technical details on the complex sample design for DAWN?

Primary sampling units (PSUs) are hospitals within strata and secondary sampling units (SSUs) are records of emergency department (ED) visits within PSUs. Some hospitals chosen with a probability equal to one in the first stage of sampling are "certainty hospitals." This means that all hospitals in a stratum are selected. So, the finite population correction factor (1-fh) is zero for those strata with certainty hospitals (since sampling was without replacement (WOR) for other strata from finite populations), and consequently there will be no variance contribution to those strata at the first-stage sampling. Where fh=nh/Nh, nh is the count of hospitals in h-th stratum and Nh is the corresponding (population) frame count given in the variable PSUFRAME. The records of the ED visits of such certainty hospitals were randomly chosen, i.e., visits were not a complete enumeration. To take into account the within-hospital variation for ED visits, the DAWN PUF provides the additional design variable, REPLICATE, for the second stage of sampling, which is required for the correct statistical inferences. In sum, each of the strata have at least 2 hospitals (PSUs) and each of the hospitals have exactly two replicates (SSUs); and each of the replicates should have numerous ED visit records.

There are some issues with variance estimation when using the Taylor method and the calculations of degrees of freedom that should be noted. The SAS, SPSS, Stata, SUDAAN, and R software packages calculate the variance contribution for each stage of the design using the deviations between the unit's value (i.e., total) and the mean of all units' values within the stage. (Unit indicates the PSU and the SSU for the first and second sampling stages.) There are no single unit (i.e., singleton) strata in the DAWN PUF data, but certain analyses may encounter singleton-stratum while calculating the variance for a domain or subclass/subgroup or subpopulation. Singleton-stratum is when a single unit (PSU or SSU) has at least one observation and other units have no observation in that stratum. Units with no observations are handled in different ways by the different software packages when calculating the variance and degrees of freedom. The MISSUNIT option in SUDAAN, singleunit(center) option in Stata, and options(survey.lonely.psu = "adjust") in R handle such cases by calculating the variance contribution for those singleton-strata using the deviations of that unit-total value and the grand mean of the sample. By default, SPSS handles this situation based on the assumption that there was at least one other unit (if not, then that stratum contributes a null variance) in that stage in the sample and thus units with no observation (sampling zeroes) would have unit totals as 0 and definitely would contribute to the stratum-variance. Moreover, an analysis can also encounter some strata with no observations (empty strata). Users may experience such a situation in domain analysis. The question is how the variance and the degrees of freedom computation handle this situation to account for the design effect into the overall variance and the degrees of freedom by the software packages. SUDAAN assumes there were actually no strata with non-missing cases in the population, but strata with missing cases as part of sample of selection. SUDAAN and R treat those missing units as sampling zeros. Thus, each of the empty strata contributes zero variance into the overall variance and in sequel, contributes to the degrees of freedom. The Stata software package does this by default method; but certain Stata procedures have options, for instance the singleunit(center) option, which digress from this assumption of sampling zeroes and consider such empty strata as structural zeros. The logic is that when a stratum has no cases at all then one should assume that this stratum is not part of the sampling for domains and they would contribute null to the overall variance. In Stata, the degrees of freedom determined with the singleunit(center) option are smaller than that obtained by the default method in those instances where domains are not in common. Note that the variance estimates from SUDAAN, Stata, and R software packages are always the same with the options stated above whether the assumption of sampling zeroes is retained or overlooked, but they produce different degrees of freedom.

The calculation of degrees of freedom (df) is crucial for all these software packages and influences calculation of inferential statistics such as confidence intervals and p-values of test statistics. Conventionally, the df is calculated by the fixed-PSU method and the 'fixed' df is defined by the number of PSUs minus number of strata for the first stage in the sample design with any number of stages of sampling (i.e., from the full data file). The SPSS and R software packages always use this fixed-PSU method for calculating the df in all aspects of analysis. This is the default setting in SUDAAN, but users can provide a predetermined number as df with the user interactive DDF= option. This fixed-PSU method is also the default in Stata, but this package has options that invoke Stata procedures to calculate an alternate df by the method known as variable-PSU method. For example, Stata with the singleunit(center) option uses the variable-PSU method for calculating the df and the variable df is calculated as the number of non-empty PSUs minus the number of non-empty strata. The number of non-empty PSUs is the number of PSUs in the sample MINUS the number of PSUs with no observation in all singleton strata. Users can manually calculate the 'variable' df for a domain analysis and specify it in SUDAAN with DDF=df parameter option in the PROC statement or specify 'design' df in Stata with svy, dof(df): in order to compare the estimates of inferential statistics across software packages.

SAMHDA's online data analysis system (called SDA) calculates slightly different but appropriate df as the number of PSUs in non-empty strata minus the number of non-empty strata. SAS, SPSS, and SDA handle singleton-strata almost equally. Note that SAS and SDA can only take into account the 1st stage sampling design effects. DAWN data in SDA use a modified (pseudo) single-stage stratified cluster sample that was prepared for compatibility with SDA's complex survey data analysis capability.

For related technical information, please see the FAQ: Accounting for the effects of complex sampling design (design effects) when analyzing DAWN data.

Friday, May 17, 2013

How do I account for effects of complex sampling design (design effects) when analyzing DAWN data?

The DAWN (Drug Abuse Warning Network) employs a two-stage (stratified cluster) sample design for the selection of hospital emergency department (ED) visits caused or contributed to by drugs. The DAWN public-use file (PUF) includes the following complex design variables: variance estimation stratum (STRATA), primary sampling unit (PSU), secondary sampling unit (REPLICATE), PSU frame size (PSUFRAME), and analysis case weight (CASEWGT). The DAWN PUF has no missing values in the design variables STRATA, PSU, PSUFRAME, REPLICATE, and CASEWGT; however, analysis variables can have missing values.

The default method for estimating standard errors is Taylor series linearization for SAS, SPSS, SUDAAN, Stata, and R software, but SAS can only account the variance contribution from the first-stage. SAS is not currently able to fully and properly account for the DAWN sampling design.

Example code/syntax specific to each statistical software package given below uses the DAWN 2010 PUF. The examples estimate the proportions (or means) of alcohol-related ED visits along with the standard errors (SE) and confidence intervals by racial group. In short, the statistical analysis plan (SAP) is to obtain the mean and its SE along with confidence intervals and other related statistics. Note that the variable ALCOHOL is coded as 0 and 1; thus, the mean of ALCOHOL is namely the proportion of ALCOHOL=1.

Users should read the help document (of her/his respective statistical package) regarding how missing values are being handled if any exist in the analysis variables.

The following SUDAAN stand-alone code uses all of the design variables provided in the PUF for appropriate calculation of the variance of means/proportions.

Proc descript design=WOR filetype=sasxport data="folder-path\dawn2010.xpt" notsorted;

nest STRATA PSU REPLICATE / MISSUNIT;
totcnt PSUFRAME _minus1_ _zero_;
weight CASEWGT;
class RACE;
table RACE;
var ALCOHOL;
print mean semean lowmean upmean
/style=nchs meanfmt=f10.8 semeanfmt=f10.8 lowmeanfmt=f10.8 upmeanfmt=f10.8;

The SPSS specific code for the same analysis is:

Get file='path\dawn2010puf.sav'.
CSPLAN ANALYSIS
/PLAN FILE='folder-path\dawn_stage2.csplan'
/PLANVARS ANALYSISWEIGHT= casewgt
/DESIGN STRATA= strata CLUSTER= psu
/ESTIMATOR TYPE = EQUAL_WOR
/POPSIZE VARIABLE=psuframe
/DESIGN CLUSTER=REPLICATE
/ESTIMATOR TYPE = WR.

CSDESCRIPTIVES
/PLAN FILE= 'folder-path\dawn_stage2.csplan'
/SUMMARY VARIABLES= alcohol
/SUBPOP TABLE= race
/MEAN
/STATISTICS SE CIN(95)
/MISSING SCOPE=ANALYSIS CLASSMISSING=EXCLUDE.

Stata specific code for the same analysis is (note that Stata commands are case sensitive):

use drivename:\path\statadatafilename.dta

gen STRATA2 = PSU  # svydescribe requires distinct naming of design variables in different stages
svyset PSU[pweight=CASEWGT], strata(STRATA) fpc(PSUFRAME) || REPLICATE, strata( STRATA2)
singleunit(centered)       # note: this line wraps from previous line
save drivename:\path\mystatadata.dta, replace

svy: mean ALCOHOL, over(RACE)
estat strata
estat effects, deff

The SAS code for this analysis, disregarding the second stage variance, is:
(with replacement at the first stage of sampling; ignoring the finite population correction)

proc surveymeans data=libname.dataSetName mean clm nomcar;
strata strata;
cluster psu;
weight casewgt;

domain race;
var alcohol;
run;

R specific code for the same analysis:

load("input-path-of-folder/34083-0001-Data.rda")  #ICPSR DAWN PUF 2010 (R data)
#da31921.001.rda is created in R console from this load command
dawn9 = da31921.0001 
rm(da31921.0001)    # to free up RAM, remove the full r data frame
keepvars = c("STRATA", "PSU", "REPLICATE", "PSUFRAME", "CASEWGT", "RACE" , "ALCOHOL")
mydat = dawn9[, keepvars]           #data with a subset variables

# ALCOHOL and RACE are factor variables; we need to make them numeric and removing value labels
library(prettyR)   # this command requests for the prettyR package to be installed
mydat$ALCOHOL <- as.numeric(sub("^\\(0*([0-9]+)\\).+$", "\\1", mydat$ALCOHOL))
mydat$RACE <- as.numeric(sub("^\\(0*([0-9]+)\\).+$", "\\1", mydat$RACE))
# to prepare the fpc variable for the 1st stage of the 2-stage variance estimation design

strpsu = unique(mydat[,c("STRATA","PSU")])
strpsu$one = 1
strpsu=aggregate(strpsu$one,by=list(strpsu$STRATA), FUN=sum, na.rm=TRUE)
library(reshape)   #to rename the function
strsample = rename(strpsu, c(Group.1="STRATA", x="PSUsample"))
strframe = unique(mydat[,c("STRATA","PSUFRAME")])
strframe = strframe[order(strframe$STRATA),]
str = merge(strframe, strsample, by=c("STRATA"))
str = transform( str, n.over.N = PSUsample / PSUFRAME )
str <- subset(str, select=-c(PSUFRAME))

mydat = merge(mydat, str, by=c("STRATA"))
rm(str, strframe, strsample, strpsu)
# survey design for the Taylor-series linearization method
library(survey)  # this command requests for the survey package to be installed
options(survey.lonely.psu = "adjust" )
# create a survey design object (desg) with the DAWN design information
mydat$zero=0    # for 2nd stage fpc
desg <- svydesign(id =~PSU + REPLICATE , strata =~STRATA + PSU, fpc =~n.over.N + zero, weights =~CASEWGT, data = mydat ,  nest = TRUE )
# to calculate the means or proportions of ALCOHOL by RACE:
out = svyby(~ALCOHOL, ~RACE, design = desg , FUN=svymean, vartype=c("se","ci"), na.rm = TRUE)
print(out)   # confidence intervals for 95%

Note that the design variables (except the sampling weight variable) do not affect first-order moment statistics (such as mean, proportion, percent, totals and weighted counts). However, the design variables must be used to produce the SE estimates of descriptive and inferential statistics.

The target population of hospitals was divided into disjoint strata (denoted as STRATA) such that every hospital in the population belongs to exactly one stratum. Hospitals are the PSUs. In the first stage of sampling, within each stratum, hospitals were selected by the simple random sampling (SRS) without replacement for finite populations. Stratum size (denoted by PSUFRAME) is the population count of hospitals (i.e., PSUs) contained in the stratum. Depending upon the year and the size of the hospital, either all REPLICATEs were selected, which resulted in no contribution to the second stage variance for that hospital, or second-stage sampling at rates as small as 1/3 was performed using systematic random sampling. This resulted in sample files that averaged about 300,000 visits per year. Tabulation and analysis of these large sample files when many variables were involved proved to be computationally very expensive due to the amount of time necessary to calculate exact estimates of the second stage variance. In multi-stage sampling, the greatest portion of the variability occurs in the first stage; thus, the second-stage contribution is usually a small fraction of the first-stage variance. Realizing that unacceptable amounts of time were required to calculate the estimates of a small fraction of total variance, an approximation method was adopted that made variance estimation for large tables with many variables practical. The sampled visits within each hospital were sorted into two groups, and the between group variance was calculated and substituted for the standard second-stage variance. DAWN research simulations showed that the above approximation is sufficient for DAWN precision and efficient for DAWN standard error calculation. Note that STRATA is the first stage and PSU is the second stage stratification variable in the sample. Proper use of the REPLICATE variable should account for within-hospital variation, i.e., the second stage variance. The final analysis weight variable, CASEWGT, was adjusted for unequal probability of selection, nonresponse of hospitals, and coverage bias of ED visits to the benchmark population totals of ED visits from American Hospital Association (AHA) database.

CASEWGT is a poststratification adjusted analysis weight variable. By using CASEWGT, the weighted estimates of counts or totals obtained for any of the survey variables are the estimates for the entire universe of DAWN-eligible hospitals in the United States. Poststratification adjustments were implemented within the design strata to offset the coverage bias on the estimates. That means post-strata were not constructed from any of the ED visit characteristic variables. Therefore, the software packages correctly take into account the design effects on the variance estimates of estimated counts or totals for variables related to DAWN visits. For further details on sample design and weighting adjustment, please see the DAWN Methodology Report.

For additional technical information on options and how the various statistical software handle issues with missing values and degrees of freedom, please see the FAQ: Technical issues of sampling design analysis of DAWN data.

Wednesday, April 10, 2013

How do I perform a homogeneity test of proportions or percentages in the R-DAS?

The R-DAS does not have the Comparison of Means analysis available. The Frequencies/Crosstabulation program has an option of Summary Statistics to perform a test of independence (or no association) between two categorical variables using the Rao-Scott F statistics. These statistics take the complex design effect into account. The test of independence of a two-way contingency table is equivalent to the test of homogeneity of row (or column) percents (StataCorp 2011, page: 141-142). The null hypothesis for the later test is that row (or column) percents are equal for every category of the column (or row) variable.

For example, if your variable of interest is levels of alcohol consumption in the Column field, you can use the row percentage option and the resulting table output to approximately determine homogeneity (or the lack of homogeneity) of row (ethnic) groups among the levels of alcohol consumption. In other words, this test determines whether the distributions of each of the ethnic groups (among the alcohol levels) are equal.

The Rao-Scott F statistics are calculated from the contingency table for Row by Column variables. The test is significant at x% level of significance if the p-value of the Rao-Scott F statistic is less than x%. And, overall, the test concludes that there is association (or dependence) between Row and Column variables. The first screenshot shows RACE4 x ALCREC (recoded) table output for Total percent with the Summary Statistics box checked. From this table display of cell percent (i.e., total percent), confidence intervals and weighted cell frequencies, it is difficult to compare the prevalence of alcohol in different racial groups. In order to interpret the table output with regard to the test of homogeneity, we have to look at the table display in a better way. The second screenshot shows the contingency table output for Row percent. A larger percent (i.e., 63.7%) of whites have consumed alcohol within the last 30 days than blacks, other, or Hispanics. Since we only changed the way the percentages are displayed, the Rao-Scott F statistic is identical for both screenshots.

Reference:

StataCorp. 2011. Stata Survey Data Reference Manual, Release 12. Statistical Software. College Station, TX: StataCorp LP.

Monday, April 8, 2013

Is there a way to compare multiple means using the MEANS analytic option in SDA?

Yes. There are ways to compare multiple means (k-1 comparisons) using the Comparison of Means Program in SDA. Note that if a dependent variable is coded as 0/1, then the mean of the dependent variable is essentially the proportion.

In the Means Program, there are different dropdown options when selecting from the main statistic to display box. The default setting is to display the Means of the dependent variable against the required Row variable categories. Row variable categories define domains for subpopulation analysis. The selection of differences from Row category option allows you to choose a base category in the If differences from a row or column, indicate base category box. When you run the analysis for this selection, the result produced in each of the other row cells is the difference between that cell's mean and the base Row category cell mean. The above selection along with the selection of the z/t-statistic and p-value options produce the (k-1) comparisons of means or proportions and associated statistics from a single comparison of means run. This is a comparison of domain means for an outcome variable where domains are defined by a Row-only variable categories.

NOTES:

  1. In comparison of means testing, there are k(k-1)/2 differences of means or proportions being compared from k domains (or subpopulations or subgroups). In a single table run, the Means program enables us to simultaneously test the differences of (k-1) pair of means from a base category mean of a k category classification variable. So you will have k(k-1)/2 distinct comparison tests after choosing a different base category from separate (k-1) Means program runs. For a Row variable with k=4 levels (say, A, B, C and D), you will obtain 6 tests of difference of means (i.e., B-A, C-A, D-A for base reference A, C-B, D-B for base reference B and D-C for base reference C) from 3 (=4-1) separate MEANS program run when A, B, and C are used as base categories respectively.
  2. The computing in the R-DAS analysis system is equivalent to the Frequencies/Crosstabulation program module in SDA. It is not possible to perform a (k-1) means or proportions comparison using the R-DAS or SDA Frequencies/Crosstabulation program, while it is possible to perform a test of homogeneity of row (or column) percents for a two-way table. There is a separate FAQ on How do I perform a homogeneity of proportions or percentages using the Frequencies/Crosstab program in R-DAS?

Additional information on using the "Main statistic to display" option in the MEANS program can be found here:  http://www.icpsr.umich.edu/SDAHELP/helpan.htm#mstats

Monday, March 11, 2013

How do I use setup files to import plain text (ASCII) data?

Many SAMHDA data collections that contain ASCII data files are accompanied by setup files that allow users to read the text files into statistical software packages. Since a visual interpretation of alphanumeric data files is inefficient, statistical software is needed to define, manipulate, extract, and analyze variables and cases within data files. SAMHDA currently provides setup files for SAS, SPSS, and Stata statistical software packages.

Utilizing Setup Files

ICPSR has prepared tutorials on how to analyze data using setup files:

  • ASCII Data File + SAS Setup Files: PDF PPT

  • ASCII Data File + SPSS Setup Files: PDF PPT

  • ASCII Data File + Stata Setup Files: PDF PPT

You can find video tutorials addressing this topic on the ICPSR YouTube channel.

Troubleshooting Setup Files

Many statistical packages will not run a set-up file unless you reset the Windows default setting that hides file extensions.

Resetting File Extensions

In Windows 7 and Windows XP, file extensions are hidden by default; however, SAS, SPSS and Stata need to see the file extension to run setup files. ICPSR has created a short tutorial to show you how to reset the file extension option in Windows, so your setup files will run properly.

You can find video tutorials addressing this topic on the ICPSR YouTube channel.

Wednesday, February 6, 2013

How do I find the number of cases with any mention of a specific drug in DAWN?

The Drug Abuse Warning Network (DAWN) public-use data file includes information on one or more substances contributing to an Emergency Departments (ED) visit. In other words, some cases report only a single substance and other cases report multiple substances (i.e. cocaine, simvastatin, and Benadryl all present in the same ED visit). Beginning with the 2009 DAWN public-use file, information is included for up to 22 drugs reported in the ED visit; an increase from the 16 drug mentions available in previous years of DAWN. Within the DAWN data, there is currently no automated way to search for a specific drug name across all drug mentions (up to 22) for each case in order to produce the total number of ED cases involving a particular substance. For instance, a particular case reports only a single substance (i.e. codeine) that is provided in variable, DRUGID_1, with 11 for codeine’s code. Then the remaining variables, DRUGID_2 to DRUGID_22, will have missing codes. Suppose another case reports four substances (i.e. ibuprofen, simvastatin, codeine, and warfarin). These substances are provided in the first four variables, DRUGID_1 to DRUGID_4, with 14 in DRUGID_1 for ibuprofen, with 468 in DRUGID_2 for simvastatin, with 11 in DRUDID_3 for codeine, with 21 in DRUGID_4 for warfarin, and the remaining DRUGID variables having missing codes.

Determining the number of DAWN ED cases that involve a particular drug is important for many types of analyses and reports.

Some information may be obtained via the Excel files available on the SAMHSA website. These tables provide weighted national estimates for a particular drug or category. Tables that provide weighted estimates for some metropolitan areas are also available.

For users looking for a specific drug that is not included in the SAMHSA tables or for those interested in more detailed statistical analysis, we provide the following programming code to create a new variable that specifically answers the question, "Is Drug _______ present at the time of ED visit, Yes/No?". This FAQ provides sample code for the following five formats: online analysis system (SDA), SAS, SPSS, Stata, and R. Some knowledge of either SDA, SAS, SPSS, Stata, or R is required.

NOTE: The sample SDA, SAS, SPSS, STATA, and R code below use the DRUGID variables and the DRUGID value/code that corresponds with codeine in the 2009 DAWN for illustration purposes. When creating your own "Is substance _____ present in the persons system" variable, you will want to find and replace information via the following steps to obtain the drug information you are specifically seeking.

  1. Determine the year of DAWN data for which you wish to create the customized variable.
  2. Decide which substance name/categorization variable is best suited for the information you are seeking. Select one category from DRUGID, SDLED_1, SDLED_2, SDLED_3, SDLED_4, SDLED_5, or SDLED_6.
  3. In the PDF or HTML codebook, find the numeric value that corresponds to the drug name/category you are investigating.
  4. Decide on a name for your newly computed variable (it should not be a variable name that already exists in the original DAWN data file).
  5. Replace the variable name, the drug ID/category code, and the final new variable name information in the sample code with the information you identified in the previous steps.

Example 1: SDA

The SDA system can be used to obtain the number of cases in which a particular drug was reported. The first step is to use the PDF codebook appendix to look up the drug id number for the substance you are interested in. For this example we have selected codeine (drug id number = 12) as the drug of interest.

In SDA, select the "Compute a new variable" option from the "Create Variables" dropdown. See Exhibit 1, below.

Exhibit 1

Once you are in the "Compute a new variable" function, the field "Expression to Define the New Variable" is shown. See Exhibit 2, below.

Exhibit 2

The code to enter into the "Expression to Define the New Variable" field for this example is (it assumes codeine):

IF (DRUGID_1 eq 11 OR DRUGID_2 eq 11 OR DRUGID_3 eq 11 OR DRUGID_4 eq 11 OR DRUGID_5 eq 11 OR DRUGID_6 eq 11 OR DRUGID_7 eq 11 OR DRUGID_8 eq 11 OR DRUGID_9 eq 11 OR DRUGID_10 eq 11 OR DRUGID_11 eq 11 OR DRUGID_12 eq 11 OR DRUGID_13 eq 11 OR DRUGID_14 eq 11 OR DRUGID_15 eq 11 OR DRUGID_16 eq 11 OR DRUGID_17 eq 11 OR DRUGID_18 eq 11 OR DRUGID_19 eq 11 OR DRUGID_20 eq 11 OR DRUGID_21 eq 11 OR DRUGID_22 eq 11)
CODEINE = 1
ELSE
CODEINE = 0

It is very important to specify "Yes" for "Include missing-data values in the computation?". Otherwise, nearly all cases would be missing for the new computed variable due to the high number of missing cases that exist by the time you get to DRUGID_22.

In the SDA Compute Program, specifying a variable label, value labels, and descriptive text (question text) is optional. These options are useful to add if you intend to paste SDA output results into a document (note: all tables and graphs produced by SDA can be copied and pasted into Excel or Word).

As with all other functions in SDA, nearly every field and option has help documentation that can be selected by clicking on the field of interest.

Example 2: SAS

data work.tmp;
set 'drivename:\filepath\filename';

Codeine=0;
if (drugid_1= 11 or drugid_2= 11 or drugid_3= 11 or drugid_4= 11 or drugid_5= 11
or drugid_6= 11 or drugid_7= 11 or drugid_8= 11 or drugid_9= 11 or drugid_10= 11
or drugid_11= 11 or drugid_12= 11 or drugid_13= 11 or drugid_14= 11 or drugid_15= 11
or drugid_16= 11 or drugid_17= 11 or drugid_18= 11 or drugid_19= 11 or drugid_20= 11
or drugid_21= 11 or drugid_22= 11) then Codeine=1;
run;

Example 3: SPSS

compute Codeine=0.
if (drugid_1= 11 or drugid_2= 11  or drugid_3= 11 or drugid_4= 11 or drugid_5= 11 or drugid_6= 11
or drugid_7= 11 or drugid_8= 11 or drugid_9= 11 or drugid_10= 11  or drugid_11= 11  or drugid_12= 11
or drugid_13= 11 or drugid_14= 11 or drugid_15= 11 or drugid_16= 11 or drugid_17= 11 or drugid_18= 11
or drugid_19= 11 or drugid_20= 11 or drugid_21= 11 or drugid_22= 11)   Codeine=1.

Example 4: Stata

*drop Codeine
gen Codeine=0
replace  Codeine=1 if drugid_1== 11 | drugid_2== 11 | drugid_3== 11 | drugid_4== 11 | drugid_5== 11 | drugid_6== 11  | drugid_7== 11 | drugid_8== 11 | drugid_9== 11 | drugid_10== 11 | drugid_11== 11 | drugid_12== 11 | drugid_13== 11  | drugid_14== 11 | drugid_15== 11 | drugid_16== 11 | drugid_17== 11 | drugid_18== 11 | drugid_19== 11 | drugid_20 == 11 | drugid_21== 11 | drugid_22== 11

Example 5: R

# get R format downloaded DAWN 2009 PUF file into R console

load("d:/~/Desktop/ICPSR_31921/DS0001/31921-0001-Data.rda")
# da31921.001.rda is created (by default) from above load command
dawn9 = da31921.0001  # make a copy of data file with a shorter name
rm(da31921.0001)    # removing the duplicate data

codeine01 = as.numeric(as.numeric(dawn9$DRUGID_1)==11)
codeine01[is.na(codeine01)] <- 0  # replacing NA by 0
codeine02 = as.numeric(as.numeric(dawn9$DRUGID_2)==11)
codeine02[is.na(codeine02)] <- 0
codeine03 = as.numeric(as.numeric(dawn9$DRUGID_3)==11)
codeine03[is.na(codeine03)] <- 0
codeine04 = as.numeric(as.numeric(dawn9$DRUGID_4)==11)
codeine04[is.na(codeine04)] <- 0
codeine05 = as.numeric(as.numeric(dawn9$DRUGID_5)==11)
codeine05[is.na(codeine05)] <- 0
codeine06 = as.numeric(as.numeric(dawn9$DRUGID_6)==11)
codeine06[is.na(codeine06)] <- 0
codeine07 = as.numeric(as.numeric(dawn9$DRUGID_7)==11)
codeine07[is.na(codeine07)] <- 0
codeine08 = as.numeric(as.numeric(dawn9$DRUGID_8)==11)
codeine08[is.na(codeine08)] <- 0
codeine09 = as.numeric(as.numeric(dawn9$DRUGID_9)==11)
codeine09[is.na(codeine09)] <- 0
codeine10 = as.numeric(as.numeric(dawn9$DRUGID_10)==11)
codeine10[is.na(codeine10)] <- 0
codeine11 = as.numeric(as.numeric(dawn9$DRUGID_11)==11)
codeine11[is.na(codeine11)] <- 0
codeine12 = as.numeric(as.numeric(dawn9$DRUGID_12)==11)
codeine12[is.na(codeine12)] <- 0
codeine13 = as.numeric(as.numeric(dawn9$DRUGID_13)==11)
codeine13[is.na(codeine13)] <- 0
codeine14 = as.numeric(as.numeric(dawn9$DRUGID_14)==11)
codeine14[is.na(codeine14)] <- 0
codeine15 = as.numeric(as.numeric(dawn9$DRUGID_15)==11)
codeine15[is.na(codeine15)] <- 0
codeine16 = as.numeric(as.numeric(dawn9$DRUGID_16)==11)
codeine16[is.na(codeine16)] <- 0
codeine17 = as.numeric(as.numeric(dawn9$DRUGID_17)==11)
codeine17[is.na(codeine17)] <- 0
codeine18 = as.numeric(as.numeric(dawn9$DRUGID_18)==11)
codeine18[is.na(codeine18)] <- 0
codeine19 = as.numeric(as.numeric(dawn9$DRUGID_19)==11)
codeine19[is.na(codeine19)] <- 0
codeine20 = as.numeric(as.numeric(dawn9$DRUGID_20)==11)
codeine20[is.na(codeine20)] <- 0
codeine21 = as.numeric(as.numeric(dawn9$DRUGID_21)==11)
codeine21[is.na(codeine21)] <- 0
codeine22 = as.numeric(as.numeric(dawn9$DRUGID_22)==11)
codeine22[is.na(codeine22)] <- 0

codeine = codeine01 + codeine02 + codeine03 + codeine04 + codeine05 + codeine06 + codeine07 + codeine08 + codeine09 + codeine10 + codeine11 + codeine12 + codeine13 + codeine14 + codeine15 + codeine16 + codeine17 + codeine18 + codeine19 + codeine20 + codeine21 + codeine22

rm(codeine01, codeine02, codeine03, codeine04, codeine05, codeine06, codeine07, codeine08, codeine09, codeine10, codeine11, codeine12, codeine13, codeine14, codeine15, codeine16, codeine17, codeine18, codeine19, codeine20, codeine21, codeine22)

table(codeine)  # to display the counts

How can I transfer output from SDA and R-DAS to a document, spreadsheet, or presentation?

You can copy and paste output, including tables and charts, from SDA and R-DAS into a document, spreadsheet, or presentation. To copy and paste, use your mouse to highlight the output and then click "copy". When pasting the output into a document, use the "paste special" option to retain the same display as in SDA and R-DAS. To transfer data only, paste the output using the standard paste option.

If you have a PDF file creator or print driver, you can also print the output to a PDF file.

Thursday, January 17, 2013

My results in R-DAS were blocked by the disclosure protection settings. How do I avoid having my output blocked?

Because of confidentiality concerns, we are unable to provide specific details about what is causing the disclosure protection settings to block output for a specific analytic run. However, we are able to provide solutions for several common reasons that analytic results are blocked.

When output is blocked, you may get one of these messages:

  • "The Row Total is equal to the value of one of the cells."
  • "To preserve confidentiality, tables cannot be displayed when the number of observations in any cell in the table is too low."

Definitions of the various blocked result messages are available in another FAQ.

Below are several examples of analytic requests where the results were blocked, and possible solutions for how to change your request to receive some analytic results.

Example 1: A user runs a crosstabulation where State is the column variable.

Possible solutions:

If interested in a single state, you might try placing the State variable in the Filter field to specify the one state for analysis. For example, entering STATE(1) in the filter field will give you results for just Alabama. Focusing your analysis on only one state might help you avoid a circumstance where a different state is causing your results to be blocked.

Another option would be to use a geographic variable like Census Region or Division in an attempt to avoid low record counts that can result in causing your results to be blocked.

Example 2: A user runs a crosstabulation where AGE is the column variable.

Possible Solutions:

The AGE variable spans an age range from 12 to 103 years old. You could try using one of the categorized age variables within the data file.

Alternatively, you could utilize the temporary recode feature in R-DAS that allows you to recode a variable into fewer categories.

Help documentation on doing temporary recodes can be found at: http://www.icpsr.umich.edu/icpsrweb/content/SAMHDA/help/helpan.htm#recode

Example 3: A user runs a three-way crosstabulation using the Row, Column, and Control fields. However, the results are blocked, and the user has no idea which variable or combination of variables contains the low record count.

Possible solutions:

Run frequencies for the variables in your analysis one at a time. One variable may stand out as having a value with a particularly low weighted frequency. It is possible that a variable has a value with such a small record count that the univariate frequency is blocked. If one variable does stand out as being the primary cause of the problem, then you could check to see if a similar variable exists with fewer categories, or you could do a temporary recode to create larger record counts.

If no single variable stands out as causing the problem, then try running crosstabs on two of your variables. If any cross combination of values from the two variables has a particularly low weighted frequency, then this can be an indicator that the combination is the cause of the problem. If one combination does stand out, you could find similar variables to the ones you chose, but have fewer categories. Again, you could do a temporary recode on one or more of your variables to create larger record counts for the categories/values of the two variables that are the possible cause of the problem.

My results in R-DAS were blocked by the disclosure protection settings. What do the various messages mean?

Below are descriptions of the most common messages that display when analytic results are blocked.

  1. The Row Total is equal to the value of one of the cells.

    This message refers to a built in disclosure limitation protection for specific crosstab output. In the following 5 X 3 crosstabulation example, the sum of the 4th row is equal to a single cell in that row. The whole table is suppressed when this happens.

    6 15 8
    9 17 8
    3 20 5
    0 5 0
    30 4 7

  2. To preserve confidentiality, tables cannot be displayed when the number of observations in any cell is too low.

    This error message states that at least one cell in the frequency of the table or crosstabulation does not meet the threshold established by CBHSQ/SAMHSA for protecting the confidentiality of respondents.

  3. To preserve confidentiality, analyses are not permitted to use the following variable(s): 'variable name'

    This message appears when one of the complex design variables (weight, strata, or cluster) is entered into one of the analysis fields (i.e. ROW). While the complex sampling design variables are used by the R-DAS system to calculate accurate statistics, the design variables are not available because of the potential disclosure risk involved.

What is the Data Portal?

The Data Portal provides secure remote access to confidential data from the Center for Behavioral Health Statistics and Quality (CBHSQ), Substance Abuse and Mental Health Services Administration (SAMHSA).

CBHSQ confidential data can only be accessed remotely through the Data Portal using special software. This virtual computing environment has been designed to provide authorized researchers access to confidential data for approved research projects. The Data Portal can only be accessed from approved computer location(s) and IP address(es) at the researcher's organization. Users are required to maintain the confidentiality of the data in the Data Portal. Researchers cannot transfer data into or out of the Data Portal.

The goal of the Data Portal is to maximize the use of CBHSQ data for important research and policy analyses, while conforming to Federal law and protecting identifiable data from disclosure.

What is the process for Data Portal approval and access?

The application process is described in detail in section 3 of the Data Portal Confidentiality Procedures Manual. An abbreviated description of the application process follows.

For each research project, the organization(s) must complete the Application for Access. Completed applications are to be submitted to SAMHDA at dataportal@icpsr.umich.edu. (The application does not need to be signed and does not need to include CVs.)

Once a complete application is submitted to SAMHDA, the Center for Behavioral Health Statistics and Quality (CBHSQ) will review the contents of the application for completeness. CBHSQ will verify that only eligible individuals will have access to the data.

CBHSQ can only approve a limited number of applications. If more completed applications are received than Data Portal resources can support, additional criteria for evaluating the applications will be used. The primary criteria for selection are:

  • The behavioral health impact of the proposed project and its potential contribution and alignment with Department of Health & Human Services and SAMHSA missions,
  • How well the research is aligned with the purpose1 for which the data were collected, and
  • Whether the data requested is suitable for the proposed research project given data limitations (available sample size or survey content).

CBHSQ will also consider secondary evaluation criteria:

  • Available resources needed by CBHSQ to prepare the data file and the cost of site inspection.
  • The experience and capabilities of the research team.

Once the application has been approved, all individuals listed on the application must participate in confidentiality training. The project team will be notified about how this training will be conducted.

After the training is completed, the applicant submits the required paperwork:

  1. Confidential Data Use and Nondisclosure Agreement (CDUNA)
  2. Designation of Agent and Affidavit of Non-Disclosure Form
  3. Declaration of Nondisclosure (for federal employees only)

Approved applicants have six (6) months to complete the required confidentiality training and submit the required forms. Applications will be terminated for any applicant who fails to meet these requirements within six (6) months of application approval. Applicants with closed applications will need to reapply for Data Portal access during a future call for applications.

When the original signed CDUNA and affidavit(s) are received by CBHSQ and CBHSQ determines they are complete and final, the Principal Project Officer (PPO) and project team will be authorized to access the Data Portal. A copy of the signed and approved CDUNA will be sent to the PPO.

An email will be sent to each approved project team member listed on the application with information on how to access the custom dataset, which will contain only the variables that were requested and approved. Access to these data is allowed only for approved project members who have signed affidavits within the last year.

1The Data Portal provides access to Drug Abuse Warning Network (DAWN) and National Survey on Drug Use and Health (NSDUH) data sets. For descriptions of these data sets, see DAWN and NSDUH resources.

How can I get help with the Data Portal?

For questions and assistance with the Data Portal, please email dataportal@icpsr.umich.edu.

SAMHDA also operates a toll-free helpline (888-741-7242) Monday through Friday, 8:00 a.m. to 5:00 p.m. (EST). The local helpline number is (734) 615-9524. Staff try to respond to email and helpline questions within one business day. Answers to many questions can be found in the Data Portal Confidentiality Procedures Manual.