Add DataGroupings and CriteriaManagers
DataGroupings
Working on a large project and having to create the same 6 groupings several times (incuding creating columns in the InformationSet for each one) gave us the idea of being able to create several DataGroupings at once with some sort of specification. Currently, DataGrouping has a Load method and its own saved format, but this is insufficient for several in a single file and is DataSet dependent.
Ideas
1
This was my first idea, using the name of the column from the InformationSet to use and which descriptors go into which group. This leaves the name of the grouping the same as the name of the info set column. However, this still requires us to create those columns in the info set beforehand. In reality, the number of columns needed to do all the groupings for the current Massion data set C (as of 23 Feb 2005) is only 4. In this example, the first line of each group is the InformationSet column name and the other two lines list the descriptors that go into each group.
normal-cancer
1:Normal
2:Cancer
nodal0-nodal123
1:N0
2:N1,N2,N3
2
The number of columns needed to do all the groupings for the current Massion data set C (as of 23 Feb 2005) is only 4. Those are Cases-Controls, Histology, Path Stage, N(odal), and M(etasis). We end up expanding this information into 1 column per comparison. If we could use descriptors from different columns, it would be easy. I'm not yet sure how to make the file, but it shouldn't be hard. In this example, the first line of each group is the DataGrouping name and the other two lines list the InformationSet columns and descriptors to use in pairs (or several descriptors) to allow for using multiple columns.
Normal-Cancer
1:cases-controls,controls
2:cases-controls,cases
Nodal0-123
1:N,0
2:N,1;N,2;N,3
Normal-Stage 1
1:cases-controls,controls
2:path stage,1a;path stage,1b;path stage,1
Working Format
A DataGrouping will be specified by three lines: a name, and specifications for groups 1 and 2. The specification for a group will be an integer (1 or 2) group number, followed by a (list of) InformationSet column name(s) and its list of descriptors. Multiple descriptors can be used for each column and multiple columns can be used for each group. Also, different columns can be used for each group, separated by a semi-colon. DataGrouping names should be unique and descriptive. Don't forget the blank line at the end.
Normal-Cancer
1:cases-controls(controls)
2:cases-controls(cases)
Nodal0-123
1:N(0)
2:N(1,2,3)
Normal-Stage 1
1:cases-controls(controls)
2:path stage(1a,1b)
2:path stage(1)
Logical Conditions
Should we also add some way to perform a logical condition in addition to the mass grouping? For instance, one column of information has the case-control description and another has the training-testing description. Otherwise, we will have to manually create the case-control-training and case-control-testing columns.
UPDATED:
The current format uses order to put columns into groups, including 0. This can be used to introduce some simple logic or just select the descriptor used for the null group, which defaults to "null".
Example
This example uses order to divide the normal-cancer comparison into training and testing groups. Notice the group 0 specifications.
Normal-Cancer
1:cases-controls(controls)
2:cases-controls(cases)
Normal-Cancer-Train
0:train-test(test)
1:cases-controls(controls)
2:cases-controls(cases)
Normal-Cancer-Test
0:train-test(train)
1:cases-controls(controls)
2:cases-controls(cases)
CriteriaManagers
Entering the same cutoff manager(s) several times for a large project can also benefit from being able to load one or more at a time. Since CriteriaManager and CutoffManager already have sufficient Load and Save methods, we can do something similar with several managers in a single file. In the examples, we could probably change "Criteria Used:" to the name.
Single CutoffManager
This is the standard format for a single CutoffManager.
Cutoff 0005
[user prefilter would go here or leave blank]
( numPass / 6 ) * ( 1 - info ) * ( asam_std + cvalue_std + dvalue_std + fvalue_std + tvalue_std + wga_std )
asam > 2
wga > 0.005
prob_c < 0.0005
prob_d < 0.0005
prob_f < 0.0005
prob_t < 0.0005
numPass > 1
Multiple CutoffManagers
We can re-use most of the CutoffManager file format and add some delimiter to separate the individual specifications. Unfortunately, we can't re-use the code in CriteriaManager.Load because it lacks SignLead and SignFollow specification and we also need to be able to distinguish between TopN and Cutoff managers. For it to recognize a TopN, the name
MUST start with "top" (not case-sensitive) and the first integer in the name is used as [N].
Cutoff 0005
[user prefilter would go here or leave blank]
( numPass / 6 ) * ( 1 - info ) * ( asam_std + cvalue_std + dvalue_std + fvalue_std + tvalue_std + wga_std )
tvalue fvalue cvalue dvalue wga asam
asam > 2
wga > 0.005
prob_c < 0.0005
prob_d < 0.0005
prob_f < 0.0005
prob_t < 0.0005
numPass > 1
Cutoff 0001
[user prefilter would go here or leave blank]
( numPass / 6 ) * ( 1 - info ) * ( asam_std + cvalue_std + dvalue_std + fvalue_std + tvalue_std + wga_std )
tvalue fvalue cvalue dvalue wga asam
asam > 4
wga > 0.01
prob_c < 0.0001
prob_d < 0.0001
prob_f < 0.0001
prob_t < 0.0001
numPass > 1
Top 200
[user prefilter would go here or leave blank]
( numPass / 7 ) * ( 1 - info ) * ( asam_std + cvalue_std + dvalue_std + fvalue_std + tvalue_std + wga_std )
tvalue fvalue cvalue dvalue wga asam
prob_t ASC NaN
prob_f ASC NaN
prob_c ASC NaN
prob_d ASC NaN
wga DESC NaN
asam DESC NaN
info == 0