Wednesday, August 14, 2013

What are the technical details on the complex sample design for DAWN?

Primary sampling units (PSUs) are hospitals within strata and secondary sampling units (SSUs) are records of emergency department (ED) visits within PSUs. Some hospitals chosen with a probability equal to one in the first stage of sampling are "certainty hospitals." This means that all hospitals in a stratum are selected. So, the finite population correction factor (1-fh) is zero for those strata with certainty hospitals (since sampling was without replacement (WOR) for other strata from finite populations), and consequently there will be no variance contribution to those strata at the first-stage sampling. Where fh=nh/Nh, nh is the count of hospitals in h-th stratum and Nh is the corresponding (population) frame count given in the variable PSUFRAME. The records of the ED visits of such certainty hospitals were randomly chosen, i.e., visits were not a complete enumeration. To take into account the within-hospital variation for ED visits, the DAWN PUF provides the additional design variable, REPLICATE, for the second stage of sampling, which is required for the correct statistical inferences. In sum, each of the strata have at least 2 hospitals (PSUs) and each of the hospitals have exactly two replicates (SSUs); and each of the replicates should have numerous ED visit records.

There are some issues with variance estimation when using the Taylor method and the calculations of degrees of freedom that should be noted. The SAS, SPSS, Stata, SUDAAN, and R software packages calculate the variance contribution for each stage of the design using the deviations between the unit's value (i.e., total) and the mean of all units' values within the stage. (Unit indicates the PSU and the SSU for the first and second sampling stages.) There are no single unit (i.e., singleton) strata in the DAWN PUF data, but certain analyses may encounter singleton-stratum while calculating the variance for a domain or subclass/subgroup or subpopulation. Singleton-stratum is when a single unit (PSU or SSU) has at least one observation and other units have no observation in that stratum. Units with no observations are handled in different ways by the different software packages when calculating the variance and degrees of freedom. The MISSUNIT option in SUDAAN, singleunit(center) option in Stata, and options(survey.lonely.psu = "adjust") in R handle such cases by calculating the variance contribution for those singleton-strata using the deviations of that unit-total value and the grand mean of the sample. By default, SPSS handles this situation based on the assumption that there was at least one other unit (if not, then that stratum contributes a null variance) in that stage in the sample and thus units with no observation (sampling zeroes) would have unit totals as 0 and definitely would contribute to the stratum-variance. Moreover, an analysis can also encounter some strata with no observations (empty strata). Users may experience such a situation in domain analysis. The question is how the variance and the degrees of freedom computation handle this situation to account for the design effect into the overall variance and the degrees of freedom by the software packages. SUDAAN assumes there were actually no strata with non-missing cases in the population, but strata with missing cases as part of sample of selection. SUDAAN and R treat those missing units as sampling zeros. Thus, each of the empty strata contributes zero variance into the overall variance and in sequel, contributes to the degrees of freedom. The Stata software package does this by default method; but certain Stata procedures have options, for instance the singleunit(center) option, which digress from this assumption of sampling zeroes and consider such empty strata as structural zeros. The logic is that when a stratum has no cases at all then one should assume that this stratum is not part of the sampling for domains and they would contribute null to the overall variance. In Stata, the degrees of freedom determined with the singleunit(center) option are smaller than that obtained by the default method in those instances where domains are not in common. Note that the variance estimates from SUDAAN, Stata, and R software packages are always the same with the options stated above whether the assumption of sampling zeroes is retained or overlooked, but they produce different degrees of freedom.

The calculation of degrees of freedom (df) is crucial for all these software packages and influences calculation of inferential statistics such as confidence intervals and p-values of test statistics. Conventionally, the df is calculated by the fixed-PSU method and the 'fixed' df is defined by the number of PSUs minus number of strata for the first stage in the sample design with any number of stages of sampling (i.e., from the full data file). The SPSS and R software packages always use this fixed-PSU method for calculating the df in all aspects of analysis. This is the default setting in SUDAAN, but users can provide a predetermined number as df with the user interactive DDF= option. This fixed-PSU method is also the default in Stata, but this package has options that invoke Stata procedures to calculate an alternate df by the method known as variable-PSU method. For example, Stata with the singleunit(center) option uses the variable-PSU method for calculating the df and the variable df is calculated as the number of non-empty PSUs minus the number of non-empty strata. The number of non-empty PSUs is the number of PSUs in the sample MINUS the number of PSUs with no observation in all singleton strata. Users can manually calculate the 'variable' df for a domain analysis and specify it in SUDAAN with DDF=df parameter option in the PROC statement or specify 'design' df in Stata with svy, dof(df): in order to compare the estimates of inferential statistics across software packages.

SAMHDA's online data analysis system (called SDA) calculates slightly different but appropriate df as the number of PSUs in non-empty strata minus the number of non-empty strata. SAS, SPSS, and SDA handle singleton-strata almost equally. Note that SAS and SDA can only take into account the 1st stage sampling design effects. DAWN data in SDA use a modified (pseudo) single-stage stratified cluster sample that was prepared for compatibility with SDA's complex survey data analysis capability.

For related technical information, please see the FAQ: Accounting for the effects of complex sampling design (design effects) when analyzing DAWN data.