Tuesday, March 25, 2014

How are complex sampling variances computed by Taylor linearization method?

The NSDUH public-use file consists of single-stage with replacement (WR) stratified cluster design sample data. The variance estimation stratum and cluster replicate variables are VESTR and VEREP, respectively. The cluster replicates are widely known as the primary sampling units (PSU). The calculation of variance by Taylor linearization method and the calculation of the degrees of freedom1 for Student's t test statistic will be discussed briefly below.

A closed-form/analytic variance formula cannot be derived for nonlinear statistics such as mean and proportion. The Taylor method derives a linearized variate for the nonlinear statistic of interest. It can be shown that the variance of linearized variate of a statistic is theoretically equal to the variance of that statistic. The variance estimation of a statistic by the Taylor method is nothing but the variance of linearized variate of that statistic by the closed-form variance method. We know that total is a linear statistic and mean is nonlinear, as it is the ratio between two linear statistics. The linearized variable for the total statistic of an analysis variable of interest is the same of that analysis variable, and therefore the variance formula of total by the Taylor series linearization method and the variance formula of total by the closed-form method are exactly the same. All SAS, SPSS, SUDAAN, Stata, and R software packages calculate the variance of a statistic from the deviations of the PSU-totals of linearized variate about the mean of all PSU-totals of linearized variate.

There are no single PSU (i.e., singleton) strata in the NSDUH data, but certain analyses may encounter singleton-stratum while calculating the variance for a domain or subclass/subgroup or subpopulation. A stratum is so called singleton-stratum when only one PSU has at least one valid observation and the other PSU has no observation in that stratum. PSUs with no observation are handled in different ways by the different software packages when calculating the variance and the degrees of freedom. The MISSUNIT option in the SUDAAN package, SINGLEUNIT(centered) option in the Stata package, and options(survey.lonely.psu=”adjust”) in the R package handle such cases by calculating the variance contribution for those singleton-strata using the deviations of PSU-total value of linearized variate about the grand mean of the sample of a particular analysis2. By default, SPSS handles this situation based on the assumption that there was at least one other PSU (if not then that stratum contribute null variance) in the sample and thus PSU with no observation (termed as sampling zeroes) would have PSU-totals as zero and definitely would contribute to the stratum-variance. Moreover, an analysis can also encounter some strata with no observations (empty strata). Users may experience such a situation in domain/subgroup analysis.

The question is how the variance estimation and the degrees of freedom computation handle this situation in different software packages. SUDAAN assumes there were actually no strata with non-missing cases in the population, but strata with missing cases as part of sample of selection. SUDAAN considers those missing units are sampling zeros; thus each of the empty strata contributes zero variance into the overall variance and in sequel, contributes to the degrees of freedom. The Stata software package does the same by default method; but the Stata procedure with certain statement/options, for instance the singleunit(centered) option, digress from this assumption of sampling zeroes and consider that such empty strata are structural zeros. The logic is that when a stratum has no cases at all then this stratum is assumed to not be part of the sampling for domains and therefore contributes null to the overall variance. In Stata, the degrees of freedom determined with the singleunit(centered) option is smaller than that obtained by the default method for domains not in common. In effect, this default method makes the increasing degrees of freedom for the variance of an estimate for subgroups with empty strata, although a significant increase in degrees of freedom is due to sampling zeros with empty strata that contribute zero into the variance. This accordingly estimates a decreased p-value of reference distribution for a hypothesis test (but the observed value of test statistic is unaltered) and also estimates a narrow confidence interval for the parameter.

The calculation of degrees of freedom (df) is crucial for all these software packages and influences the calculation of inferential statistics such as confidence intervals and p-values of test statistics. Conventionally, the df is calculated by the fixed-PSU method and the 'fixed' df is defined by the number of PSUs minus number of strata for the first stage in the sample design with any number of stages of sampling (i.e., from full data file). The SPSS and R software packages always use this fixed-PSU method for calculating the df in all aspects of analysis. This is the default setting in SUDAAN, but users can provide a predetermined number as the df with the user interactive DDF= option.  This fixed-PSU method is also the default in Stata, but this package has options that invoke Stata procedures to calculate an alternate df by the method known as the variable-PSU method. For example, Stata with the single unit (center) option uses the variable-PSU method for calculating the df and the variable df is calculated as the number of non-empty PSUs minus the number of non-empty strata. The number of non-empty PSUs is the number of PSUs in the sample MINUS the number of PSUs with no observation in all singleton strata. A user can manually calculate the 'variable' df for a domain analysis and specify it in SUDAAN with DDF=df parameter option in the PROC statement or specify 'design' df in Stata with svy, dof(df): in order to compare the estimates of inferential statistics across software packages. The DFADJ option with DOMAIN statement in SAS code computes the degrees of freedom for non-empty strata for an analysis variable in a domain.

SAMHDA's online data analysis system (SDA) calculates slightly different but appropriate df as the number of PSUs in non-empty strata minus the number of non-empty strata. SAS, SPSS and SDA handle singleton-strata almost equally.



1 Confidence interval of an estimator (e.g., mean or proportion) of a parameter is an inferential statistic that is being calculated using the critical value of a test statistic (e.g., for mean or proportion, in practice, Student's t statistic is used); obtained based on two factors: the confidence level and the degrees of freedom.

2 Grand mean is calculated from all valid PSU-totals of linearized variate for a particular domain or subclass /subgroup /subpopulation.

No comments:

Post a Comment