Large Data Set Capability
Detailed Time Line and Tasks
Goals
Obstructions
Data Set MemoryMap
Read only
Read/Write
Large File abilities
Data partitioning
DataSet Partitioning to Save Memory
Questions/Problems
Other Memory Considerations
Run partitioning
Detailed Time Line and Tasks
Large DataSet Implementaiton
Apr18-May 06
3 weeks
Implement objects.
Build the interface.
Large DataSet Testing
May 09-May 20
2 weeks
Test to ensure complete, correct product.
Goals
Ability to read very large datasets.
Ability to generate scores on very large datasets.
Ability to perform full cross validitaion on very large datasets.
Transparent modifiations to current system.
Obstructions
1.2 GB per process memory limit.
MemoryMap library: What are the limits?
Methods
Data Set MemoryMap
Read only
Read through file at load to determine row begining locations.
Store row begining locations in memory.
Index by name, id, index, or all three?
Keep last 20 rows accessed in a buffer queue.
Index by name, id, index, or all three?
Include a DiscardBuffer method?
When accessing a point not in memory, queue it.
MemoryMap mode specified at load.
DataSet.Load or program start?
Since some functionality will be disabled/lost, it may be better to use a subclass (inheritance & polymorphism) than modify DataSet.
Wfccm Large DataSet SubClass ProsCons
-
IDataSet
Could be done with FileStream and StreamReader.
Use StreamReader.ReadLine().Length + 2 (DOS) or 1 (UNIX/Mac) to get the number of characters/bytes(ASCII) in a single line.
Combine with Partioning to save on seeking and reading.
Read/Write
Must know the size of the data set.
Create memory map on disk.
Save to a standard file.
Must have free space to hold memory map.
Large File abilities
Merge/Append two large files on disk with error checking.
Data partitioning
DataSet Partitioning to Save Memory
Emulate SAS behavior of loading only part of a file
Implement in the application when DataSets are loaded and used
Make multiple calls to generate scores for a single test
Exception: SAM
Add a constructor/method for loading a partial DataSet
Using IEnumerable, a foreach loop works:
PartitionedDataSet pds = new PartitionedDataSet(filename);
foreach
(DataSet ds
in
pds)
{
// do some work
}
DataSet has a method LoadPart for loading a set number of rows.
Score has a method MergeSave for saving the stats from the parts of the DataSet loaded with LoadPart.
These two could be combined with some wrapper object like PartitionedDataSet that would determine a good number of rows for each partition.
Add methods to DataSet/Score for an Append and/or Merge Write.
Combine with MemoryMap for large data?
Questions/Problems
How many parts should the data be divided into?
Fixed number of rows (variables).
Number of rows based on number of columns (observations) so as not to exceed memory limits.
Solution:
MemoryMapDataSet
SAM uses the entire DataSet to compute s0 (fudge factor).
Distance requires the entire DataSet.
Other Memory Considerations
How can we save memory within the program?
Write scores to file and load when needed.
Run partitioning
This topic: Main
>
Projects
>
MicroArrayMassSpec
>
GeneralWfccmDesign
>
WfccmLargeDataSetCapability
Topic revision:
23 May 2005,
WillGray
Copyright © 2013-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Vanderbilt Biostatistics Wiki?
Send feedback