the data_check program (AKA: convert)

Data Dictionary format.

Terms

  • Variable ranges - the definition of of all the possible values that a variable can be what are the valid possiblities.
  • Header Row - the row where each element is taken to be the name of the column.

File General file format.

  • The first non-blank line should contain the name of the data set as the only element in that row.
  • The first row after the title row with a value in the first column is the column header row.

Examples

Valid
    Test Data Set      
           
Question Variable Name Variable Type Answer Values Answer Meanings Access Location
How many chickens are there CHICKNUM int 0 - 99    

Invalid
    Test Data Set      
part 1          
Question Variable Name Variable Type Answer Values Answer Meanings Access Location
How many chickens are there CHICKNUM int 0 - 99    

Column Headers

This section lays out the format of the data dictionary for use with the data_check program.

  • Variable names must be in a column named variable name.
  • Variable types must be in a column named variable type.
  • Variable ranges must be in a column named answer values.
  • There must be a column named answer meaning.
  • all column headers are case insensitive.
  • All variable lines must have a variable name entry and a variable type entry.

Example
blah variable name foo variable type bar answer values answer meaning
asd CAT asdf int asdf 1-9  

Variable Blocks

Variables of the same type can be grouped together to share the same answer values and answer meaning columns.
  • A variable block must all be the same type.
  • A variable block is started by the first non blank line after the end of another variable block.
  • A variable block is ended by a row that has no elements in it.

variable name variable type answer values answer meaning
BOB int 1-5  
DOUG int    
CAT int    
       
DOG enum 1,2,3 1 = Yes
BIRD enum   1 = No
RAT enum   3 = Maybe
       

Recognized Variable Types

data_check requires that all variables be declared as a type. These types are listed here.
  • int - type for all variables that are integers but do not represent a year. e.g. 1,3, and 5.
  • float - type for all variables which are not integers. e.g. 1.23,5.62, and 10e-5.
  • string - type for variables that are alphanumeric in nature. e.g. "Hello", "Cat92", and "5B".
  • enum - type for variables that are categorical in nature.
  • date - type for variable that are dates. e.g. "1/27/1900"
  • year - type for variable that are year numbers. e.g. 1985, 1960, and 1906.
    • Note: all years cannot be entered a 2 digit numbers e.g. 02 for 2002.

Non-String Range Checks

All variables can have range checks. All of the range check except string are represented in the same way.

  • A range check is a coma separated list of test values. A value passes range checking if it matches any of the test values in the range check list.
  • Test values can be either a value of the variable type or a range of that variable type.
  • A range is specified as follows value1 - value2 where value1 < value2 and the type of value1 and value2 is equal to the variable type.

Examples

  • 1, 3, 4, 5, 10 - 20, 50 to 60
  • 1 or 2
  • 1/27/1956, 1/1/2003 to 12/31/2004, 1/1/2002 - 1/12/2002

Categorical Variables

Categorical Variable or type enum have to have their value to category mapping defined.
  • 0 = Some string defines that a value of 1 maps to the category of Some string.
Below is an example of the format of data dictionary that is feed into to data_check program.

Examples

  • 0 = Tuna
  • 1 = Yes

Blank Values

Blank values are values that the subject did not fill in an answer for. In the event that the data set uses a specific value to indicate BLANK values this can be specified by placing a BLANK indicator in the answer meaning column within the variable's chunk.
  • BLANK = value defines the numeric value of value is actually a blank value.
  • BLANK = "Some value" defines the string value of Some Value is actually a blank value.

Examples

  • BLANK = 0
  • BLANK = "whogabooga hee"

String Range checking

Defining String range checking in the following manner.
  • In the answer values column place a regular expression defining in broad terms what the string should look like. The regular expression is defined as /expression/. If the string does not match the regular expression it fails range checking.
  • Optionally to specify complicated relations with in a string first use ()'s to select what part of the string to compare. Then within the variable's block answer meaning column specify a list of possible values. The possible value list is declared as [ range check ]. This list of possible values will have as many elements as ()'s were placed in the regular expression and is separated by the | character. Each of the elements of the possible values list is a range check value. All of the possible values list entries must match the corresponding value of the string within the () in order to be considered a match. Multiple possible value list lines may exist in each variable block. The range check succeeds if any of the possible value list match.

Examples

Variable Name Variable Type Answer Values Answer Meaning
DOG string /\A(\w{3})\s*(\d)\e/ [GIA, BOB &vbar; 5-7]
      [REA, HAS &vbar; 1, 9]

  • DOG will complete range check on values of "GIA6", "BOB 5", "REA1".
  • DOG will fail range check on values of "GEB6", "BOB9", "HAS4".
Topic revision: r4 - 27 Nov 2017, DalePlummer
 

This site is powered by FoswikiCopyright © 2013-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Vanderbilt Biostatistics Wiki? Send feedback