Add DataGroupings and CriteriaManagers

DataGroupings

Working on a large project and having to create the same 6 groupings several times (incuding creating columns in the InformationSet for each one) gave us the idea of being able to create several DataGroupings at once with some sort of specification. Currently, DataGrouping has a Load method and its own saved format, but this is insufficient for several in a single file and is DataSet dependent.

Logical Conditions

Should we also add some way to perform a logical condition in addition to the mass grouping? For instance, one column of information has the case-control description and another has the training-testing description. Otherwise, we will have to manually create the case-control-training and case-control-testing columns.

Idea 1

This was my first idea, using the name of the column from the InformationSet to use and which descriptors go into which group. This leaves the name of the grouping the same as the name of the info set column. However, this still requires us to create those columns in the info set beforehand. In reality, the number of columns needed to do all the groupings for the current Massion data set C (as of 23 Feb 2005) is only 4. In this example, the first line of each group is the InformationSet column name and the other two lines list the descriptors that go into each group.

    normal-cancer
      1:Normal
      2:Cancer
    nodal0-nodal123
      1:N0
      2:N1,N2,N3

Idea 2

The number of columns needed to do all the groupings for the current Massion data set C (as of 23 Feb 2005) is only 4. Those are Cases-Controls, Histology, Path Stage, N(odal), and M(etasis). We end up expanding this information into 1 column per comparison. If we could use descriptors from different columns, it would be easy. I'm not yet sure how to make the file, but it shouldn't be hard. In this example, the first line of each group is the DataGrouping name and the other two lines list the InformationSet columns and descriptors to use in pairs (or several descriptors) to allow for using multiple columns.

    Normal-Cancer
      1:cases-controls,controls
      2:cases-controls,cases
    Nodal0-123
      1:N,0
      2:N,1;N,2;N,3
    Normal-Stage 1
      1:cases-controls,controls
      2:path stage,1a;path stage,1b;path stage,1

Working Format

A DataGrouping will be specified by three lines: a name, and specifications for groups 1 and 2. The specification for a group will be an integer (1 or 2) group number, followed by a (list of) InformationSet column name(s) and its list of descriptors. Multiple descriptors can be used for each column and multiple columns can be used for each group. Also, different columns can be used for each group, separated by a semi-colon. DataGrouping names should be unique and descriptive.

    Normal-Cancer
      1:cases-controls(controls)
      2:cases-controls(cases)
    Nodal0-123
      1:N(0)
      2:N(1,2,3)
    Normal-Stage 1
      1:cases-controls(controls)
      2:path stage(1a,1b);path stage(1)

CriteriaManagers

Entering the same cutoff manager(s) several times for a large project can also benefit from being able to load one or more at a time. Since CriteriaManager and CutoffManager already have sufficient Load and Save methods, we can do something similar with several managers in a single file. In the examples, we could probably change "Criteria Used:" to the name.

Single CutoffManager

This is the standard format for a single CutoffManager.

    Cutoff 0005:
    
    [user prefilter would go here]
    
    ( numPass / 6 ) * ( 1 - info ) * ( asam_std + cvalue_std + dvalue_std + fvalue_std + tvalue_std + wga_std )
    
    asam > 2
    wga > 0.005
    prob_c < 0.0005
    prob_d < 0.0005
    prob_f < 0.0005
    prob_t < 0.0005
    numPass > 1

Multiple CutoffManagers

We can re-use the CutoffManager file format and add some delimiter to separate the individual specifications.

    Cutoff 0005
    
    [user prefilter would go here]
    
    ( numPass / 6 ) * ( 1 - info ) * ( asam_std + cvalue_std + dvalue_std + fvalue_std + tvalue_std + wga_std )
    
    asam > 2
    wga > 0.005
    prob_c < 0.0005
    prob_d < 0.0005
    prob_f < 0.0005
    prob_t < 0.0005
    numPass > 1

    ---- (or some other delimiter)

    Cutoff 0001
    
    [user prefilter would go here]
    
    ( numPass / 6 ) * ( 1 - info ) * ( asam_std + cvalue_std + dvalue_std + fvalue_std + tvalue_std + wga_std )
    
    asam > 4
    wga > 0.01
    prob_c < 0.0001
    prob_d < 0.0001
    prob_f < 0.0001
    prob_t < 0.0001
    numPass > 1

Edit | Attach | Print version | History: r9 | r7 < r6 < r5 < r4 | Backlinks | View wiki text | Edit WikiText | More topic actions...
Topic revision: r6 - 09 Mar 2005, WillGray
 

This site is powered by FoswikiCopyright © 2013-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Vanderbilt Biostatistics Wiki? Send feedback