Add DataGroupings and CriteriaManagers

DataGroupings

Working on a large project and having to create the same 6 groupings several times (incuding creating columns in the InformationSet for each one) gave us the idea of being able to create several DataGroupings at once with some sort of specification. Currently, DataGrouping has a Load method and its own saved format, but this is insufficient for several in a single file and is DataSet dependent.

Ideas

1

This was my first idea, using the name of the column from the InformationSet to use and which descriptors go into which group. This leaves the name of the grouping the same as the name of the info set column. However, this still requires us to create those columns in the info set beforehand. In reality, the number of columns needed to do all the groupings for the current Massion data set C (as of 23 Feb 2005) is only 4. In this example, the first line of each group is the InformationSet column name and the other two lines list the descriptors that go into each group.

    normal-cancer
      1:Normal
      2:Cancer
    nodal0-nodal123
      1:N0
      2:N1,N2,N3

2

The number of columns needed to do all the groupings for the current Massion data set C (as of 23 Feb 2005) is only 4. Those are Cases-Controls, Histology, Path Stage, N(odal), and M(etasis). We end up expanding this information into 1 column per comparison. If we could use descriptors from different columns, it would be easy. I'm not yet sure how to make the file, but it shouldn't be hard. In this example, the first line of each group is the DataGrouping name and the other two lines list the InformationSet columns and descriptors to use in pairs (or several descriptors) to allow for using multiple columns.

    Normal-Cancer
      1:cases-controls,controls
      2:cases-controls,cases
    Nodal0-123
      1:N,0
      2:N,1;N,2;N,3
    Normal-Stage 1
      1:cases-controls,controls
      2:path stage,1a;path stage,1b;path stage,1

Working Format

A DataGrouping will be specified by three lines: a name, and specifications for groups 1 and 2. The specification for a group will be an integer (1 or 2) group number, followed by a (list of) InformationSet column name(s) and its list of descriptors. Multiple descriptors can be used for each column and multiple columns can be used for each group. Also, different columns can be used for each group, separated by a semi-colon. DataGrouping names should be unique and descriptive. Don't forget the blank line at the end.

    Normal-Cancer
      1:cases-controls(controls)
      2:cases-controls(cases)

    Nodal0-123
      1:N(0)
      2:N(1,2,3)

    Normal-Stage 1
      1:cases-controls(controls)
      2:path stage(1a,1b)
      2:path stage(1)

Logical Conditions

Should we also add some way to perform a logical condition in addition to the mass grouping? For instance, one column of information has the case-control description and another has the training-testing description. Otherwise, we will have to manually create the case-control-training and case-control-testing columns.

UPDATED: The current format uses order to put columns into groups, including 0. This can be used to introduce some simple logic or just select the descriptor used for the null group, which defaults to "null".

Example

This example uses order to divide the normal-cancer comparison into training and testing groups. Notice the group 0 specifications.

    Normal-Cancer
      1:cases-controls(controls)
      2:cases-controls(cases)

    Normal-Cancer-Train
      0:train-test(test)
      1:cases-controls(controls)
      2:cases-controls(cases)

    Normal-Cancer-Test
      0:train-test(train)
      1:cases-controls(controls)
      2:cases-controls(cases)

CriteriaManagers

Entering the same cutoff manager(s) several times for a large project can also benefit from being able to load one or more at a time. Since CriteriaManager and CutoffManager already have sufficient Load and Save methods, we can do something similar with several managers in a single file. In the examples, we could probably change "Criteria Used:" to the name.

Single CutoffManager

This is the standard format for a single CutoffManager.

    Cutoff 0005
    
    [user prefilter would go here or leave blank]
    
    ( numPass / 6 ) * ( 1 - info ) * ( asam_std + cvalue_std + dvalue_std + fvalue_std + tvalue_std + wga_std )
    
    asam > 2
    wga > 0.005
    prob_c < 0.0005
    prob_d < 0.0005
    prob_f < 0.0005
    prob_t < 0.0005
    numPass > 1

Multiple CutoffManagers

We can re-use most of the CutoffManager file format and add some delimiter to separate the individual specifications. Unfortunately, we can't re-use the code in CriteriaManager.Load because it lacks SignLead and SignFollow specification and we also need to be able to distinguish between TopN and Cutoff managers. For it to recognize a TopN, the name MUST start with "top" (not case-sensitive) and the first integer in the name is used as [N].

    Cutoff 0005
    
    [user prefilter would go here or leave blank]
    
    ( numPass / 6 ) * ( 1 - info ) * ( asam_std + cvalue_std + dvalue_std + fvalue_std + tvalue_std + wga_std )
    
    tvalue fvalue cvalue dvalue wga asam

    asam > 2
    wga > 0.005
    prob_c < 0.0005
    prob_d < 0.0005
    prob_f < 0.0005
    prob_t < 0.0005
    numPass > 1

    Cutoff 0001
    
    [user prefilter would go here or leave blank]
    
    ( numPass / 6 ) * ( 1 - info ) * ( asam_std + cvalue_std + dvalue_std + fvalue_std + tvalue_std + wga_std )
    
    tvalue fvalue cvalue dvalue wga asam

    asam > 4
    wga > 0.01
    prob_c < 0.0001
    prob_d < 0.0001
    prob_f < 0.0001
    prob_t < 0.0001
    numPass > 1

    Top 200

     [user prefilter would go here or leave blank]
    
    ( numPass / 7 ) * ( 1 - info ) * ( asam_std + cvalue_std + dvalue_std + fvalue_std + tvalue_std + wga_std )
    
    tvalue fvalue cvalue dvalue wga asam

    prob_t ASC NaN
    prob_f ASC NaN
    prob_c ASC NaN
    prob_d ASC NaN
    wga DESC NaN
    asam DESC NaN
    info == 0

This topic: Main > Projects > MicroArrayMassSpec > GeneralWfccmDesign > WfccmAddGroupingsCriteriaManagers
Topic revision: 11 May 2005, WillGray
 
This site is powered by FoswikiCopyright © 2013-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Vanderbilt Biostatistics Wiki? Send feedback