Thursday, January 19, 2012

How does SAMHDA prepare public-use data files for release?

SAMHDA follows a series of steps for archiving each new SAMHSA data set. SAMHDA works with SAMHSA program staff to make any necessary corrections to the data and remedy any problems uncovered during data review.

Processing a study for public use requires that all variables, missing data codes, and coding schemes be standardized across elements of a study. This stage of processing may be lengthy depending on the data and completeness of materials received. All variables must be examined to ensure that each is identified and labeled. When variables are not thoroughly described, SAMHDA staff consult the documentation and/or questionnaires.

Each study is assessed to determine if any issues of respondent confidentiality exist, and checks are made for problems arising from either direct or indirect identifying variables. Direct identifiers may be blanked or deleted to safeguard privacy before releasing the data to the public. Reducing the disclosure risk introduced by indirect identifiers may involve recoding the data. For example, dates may be converted to time intervals; this allows for time lapse analyses without providing exact dates that might permit identification of respondents. Variables such as age and income may be converted to categories.

The technical characteristics of the documentation are verified against the data to ensure that the data and documentation match. Information relating to the data collection as a whole are examined (e.g., number of cases, number of variables, number of data files, record length, data structure, and how multiple files are linked). User-defined missing data codes and weights are documented and inter-field consistency checks are performed. Value labels are added when they are not part of the files that were received.

After the initial processing is complete, further quality checks are made. For example, the observed frequencies are verified against the reported frequencies and checks are made for consistency of survey responses and skip patterns. Data files are also reformatted to the smallest possible size for optimum transfer speed over the Internet.

Finally, public-use data files are released as SAS Transport (CPORT), SPSS System, Stata System, R system, ASCII Tab-delimited, and ASCII rectangular format with SAS, SPSS and Stata data definition statements (setup files). Supplemental files containing optional commands are available for the SAS Transport and Stata System files. When possible, data sets and codebooks are prepared for compatibility with SAMHDA's public-use online analysis system (SDA (?)).

No comments:

Post a Comment