Friday, January 20, 2012

How do I interpret a record from an ASCII data file?

Data files in SAMHDA are usually distributed as columnar ASCII files that consist of rows and columns of alphanumeric characters. Since ASCII data files are text files, they can be opened in any word processing program or Internet browser. However, the alphanumeric characters are not meaningful without the help of a codebook or setup files to identify the columns of the ASCII data file as particular variables.

This example illustrates how to interpret an ASCII data file for the Treatment Episode Data Set - Discharges (TEDS-D), 2009 (ICPSR 33621).

The data file consists of 1,620,588 cases or observations, which in this example are treatment discharges. Example 1 shows the first 10 lines of data in this file. The first observation, or line of data, is highlighted in yellow.

Example 1: The first case or line of data in the data file

The data file is a fixed format data file and is stored in a logical record length of 127. This means that each line is comprised of 127 characters. These 127 characters correspond to 65 variables or data items. In example 2, the first and last columns are highlighted. The first column is labeled with a “1” and contains a value of “0”; the last column is labeled with “127” and contains a value of “8.”

Example 2: Each record is the same length (127 characters long)

In order to know which columns comprise particular variables, it is necessary to refer to the TEDS-D, 2009 codebook. The following examples illustrate how to read the first five variables from this ASCII data file, beginning with the first record (row) and counting from left to right:

VARIABLE 1

CASEID-CASE IDENTIFICATION NUMBER: This variable is positioned in column locations 1 through 8 and contains the value "1" for the first record (highlighted in red). This value represents the first sequential case identification number and is used to uniquely identify a given record in the data file.

Example 3: Variable 1 in Columns 1-8

VARIABLE 2

YEAR-YEAR OF DISCHARGE: This variable is positioned in column locations 9 through 12 and represents the year of the client's discharge from substance abuse treatment. Each record in the data file has the value "2009."

Example 4: Variable 2 in Columns 9-12

VARIABLE 3

AGE-AGE (RECODED): This variable is positioned in column locations 13 through 14 and contains the value "6" for the first record. This value represents the age category of “25-29.”.

Example 5: Variable 3 in Column 13-14

VARIABLE 4

GENDER-SEX: This variable is positioned in column locations 15 through 16 and contains the value "1" for the first record. This code identifies the sex of this client as "MALE."

Example 6: Variable 4 in Columns 15-16

VARIABLE 5

RACE-RACE: This variable is positioned in column locations 17 through 18 and contains the value “4” for the first record. This code identifies this client as “BLACK OR AFRICAN AMERICAN.”

Example 7: Variable 5 in Columns 17-18

Commercially available statistical software packages such as SAS, SPSS, and Stata may make it easier to interpret data files and to subset the variables and/or cases as needed.

No comments:

Post a Comment