the data_check program (AKA: convert)
Data Dictionary format.
Terms
- Variable ranges - the definition of of all the possible values that a variable can be what are the valid possiblities.
- Header Row - the row where each element is taken to be the name of the column.
File General file format.
- The first non-blank line should contain the name of the data set as the only element in that row.
- The first row after the title row with a value in the first column is the column header row.
Examples
Valid
|
|
Test Data Set |
|
|
|
|
|
|
|
|
|
Question |
Variable Name |
Variable Type |
Answer Values |
Answer Meanings |
Access Location |
How many chickens are there |
CHICKNUM |
int |
0 - 99 |
|
|
Invalid
|
|
Test Data Set |
|
|
|
part 1 |
|
|
|
|
|
Question |
Variable Name |
Variable Type |
Answer Values |
Answer Meanings |
Access Location |
How many chickens are there |
CHICKNUM |
int |
0 - 99 |
|
|
Column Headers
This section lays out the format of the data dictionary for use with the data_check program.
- Variable names must be in a column named
variable name
.
- Variable types must be in a column named
variable type
.
- Variable ranges must be in a column named
answer values
.
- There must be a column named
answer meaning
.
- all column headers are case insensitive.
- All variable lines must have a
variable name
entry and a variable type
entry.
Example
blah |
variable name |
foo |
variable type |
bar |
answer values |
answer meaning |
asd |
CAT |
asdf |
int |
asdf |
1-9 |
|
Variable Blocks
Variables of the same type can be grouped together to share the same
answer values
and
answer meaning
columns.
- A variable block must all be the same type.
- A variable block is started by the first non blank line after the end of another variable block.
- A variable block is ended by a row that has no elements in it.
variable name |
variable type |
answer values |
answer meaning |
BOB |
int |
1-5 |
|
DOUG |
int |
|
|
CAT |
int |
|
|
|
|
|
|
DOG |
enum |
1,2,3 |
1 = Yes |
BIRD |
enum |
|
1 = No |
RAT |
enum |
|
3 = Maybe |
|
|
|
|
Recognized Variable Types
data_check requires that all variables be declared as a type. These types are listed here.
- int - type for all variables that are integers but do not represent a year. e.g. 1,3, and 5.
- float - type for all variables which are not integers. e.g. 1.23,5.62, and 10e-5.
- string - type for variables that are alphanumeric in nature. e.g. "Hello", "Cat92", and "5B".
- enum - type for variables that are categorical in nature.
- date - type for variable that are dates. e.g. "1/27/1900"
- year - type for variable that are year numbers. e.g. 1985, 1960, and 1906.
- Note: all years cannot be entered a 2 digit numbers e.g. 02 for 2002.
Non-String Range Checks
All variables can have range checks. All of the range check except string are represented in the same way.
- A
range check
is a coma separated list of test values. A value passes range checking if it matches any of the test values in the range check
list.
- Test values can be either a
value
of the variable type or a range
of that variable type.
- A
range
is specified as follows value1 - value2
where value1 < value2 and the type of value1 and value2 is equal to the variable type.
Examples
-
1, 3, 4, 5, 10 - 20, 50 to 60
-
1 or 2
-
1/27/1956, 1/1/2003 to 12/31/2004, 1/1/2002 - 1/12/2002
Categorical Variables
Categorical Variable or type enum have to have their value to category mapping defined.
-
0 = Some string
defines that a value of 1
maps to the category of Some string
.
Below is an example of the format of data dictionary that is feed into to data_check program.
Examples
Blank Values
Blank values are values that the subject did not fill in an answer for. In the event that the data set uses a specific value to indicate BLANK values this can be specified by placing a BLANK indicator in the
answer meaning
column within the variable's chunk.
-
BLANK = value
defines the numeric value of value
is actually a blank value.
-
BLANK = "Some value"
defines the string value of Some Value
is actually a blank value.
Examples
-
BLANK = 0
-
BLANK = "whogabooga hee"
String Range checking
Defining String range checking in the following manner.
- In the
answer values
column place a regular expression defining in broad terms what the string should look like. The regular expression is defined as /expression/
. If the string does not match the regular expression it fails range checking.
- Site with more information about regular expressions.
- Optionally to specify complicated relations with in a string first use ()'s to select what part of the string to compare. Then within the variable's block
answer meaning
column specify a list of possible values. The possible value list
is declared as [ range check ]
. This list of possible values will have as many elements as ()'s were placed in the regular expression and is separated by the |
character. Each of the elements of the possible values list is a range check
value. All of the possible values list
entries must match the corresponding value of the string within the () in order to be considered a match. Multiple possible value list
lines may exist in each variable block. The range check succeeds if any of the possible value list
match.
Examples
Variable Name |
Variable Type |
Answer Values |
Answer Meaning |
DOG |
string |
/\A(\w{3})\s*(\d)\e/ |
[GIA, BOB &vbar; 5-7] |
|
|
|
[REA, HAS &vbar; 1, 9] |
- DOG will complete range check on values of "GIA6", "BOB 5", "REA1".
- DOG will fail range check on values of "GEB6", "BOB9", "HAS4".