You are here: Vanderbilt Biostatistics Wiki>Main Web>Projects>MicroArrayMassSpec>GeneralWfccmDesign>WfccmLargeDataSetCapability (23 May 2005, WillGray)Edit Attach

Large Data Set Capability

- Detailed Time Line and Tasks
Goals
Obstructions
Data Set MemoryMap
Data partitioning
Run partitioning

Detailed Time Line and Tasks

Large DataSet Implementaiton Apr18-May 06 3 weeks

Implement objects.
Build the interface.

Large DataSet Testing May 09-May 20 2 weeks

Test to ensure complete, correct product.

Goals

Ability to read very large datasets.
Ability to generate scores on very large datasets.
Ability to perform full cross validitaion on very large datasets.
Transparent modifiations to current system.

Obstructions

1.2 GB per process memory limit.
MemoryMap library: What are the limits?

Methods

Data Set MemoryMap

Read only

Read through file at load to determine row begining locations.
Store row begining locations in memory.
- Index by name, id, index, or all three?
Keep last 20 rows accessed in a buffer queue.
- Index by name, id, index, or all three?
- Include a DiscardBuffer method?
When accessing a point not in memory, queue it.
MemoryMap mode specified at load.
- DataSet.Load or program start?
- Since some functionality will be disabled/lost, it may be better to use a subclass (inheritance & polymorphism) than modify DataSet.
  - WfccmLargeDataSetSubClassProsCons - IDataSet
Could be done with FileStream and StreamReader.
- Use StreamReader.ReadLine().Length + 2 (DOS) or 1 (UNIX/Mac) to get the number of characters/bytes(ASCII) in a single line.
- Combine with Partioning to save on seeking and reading.

Read/Write

Must know the size of the data set.
Create memory map on disk.
Save to a standard file.
Must have free space to hold memory map.

Large File abilities

Merge/Append two large files on disk with error checking.

Data partitioning

DataSet Partitioning to Save Memory

Emulate SAS behavior of loading only part of a file
Implement in the application when DataSets are loaded and used
- Make multiple calls to generate scores for a single test
  - Exception: SAM
Add a constructor/method for loading a partial DataSet
Using IEnumerable, a foreach loop works:
PartitionedDataSet pds = new PartitionedDataSet(filename);
foreach (DataSet ds in pds)
{
// do some work
}
- DataSet has a method LoadPart for loading a set number of rows.
- Score has a method MergeSave for saving the stats from the parts of the DataSet loaded with LoadPart.
- These two could be combined with some wrapper object like PartitionedDataSet that would determine a good number of rows for each partition.
Add methods to DataSet/Score for an Append and/or Merge Write.
Combine with MemoryMap for large data?

Questions/Problems

How many parts should the data be divided into?
- Fixed number of rows (variables).
- Number of rows based on number of columns (observations) so as not to exceed memory limits.
Solution: MemoryMapDataSet
- SAM uses the entire DataSet to compute s0 (fudge factor).
- Distance requires the entire DataSet.

Other Memory Considerations

How can we save memory within the program?
- Write scores to file and load when needed.

Run partitioning

Topic revision: r13 - 23 May 2005, WillGray

Main

Department Home Page

Biostatistics Graduate Program

Vanderbilt University Medical Center

Biostatistics Webs
- Archive
- Main
- Sandbox
- System

Copyright &© 2013-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Vanderbilt Biostatistics Wiki? Send feedback