Large Data Set Capability

Detailed Time Line and Tasks

Large DataSet Implementaiton Apr18-May 06 3 weeks
  1. Implement objects.
  2. Build the interface.

Large DataSet Testing May 09-May 20 2 weeks
  1. Test to ensure complete, correct product.


  • Ability to read very large datasets.
  • Ability to generate scores on very large datasets.
  • Ability to perform full cross validitaion on very large datasets.
  • Transparent modifiations to current system.


  • 1.2 GB per process memory limit.
  • MemoryMap library: What are the limits?


Data Set MemoryMap

Read only

  • Read through file at load to determine row begining locations.
  • Store row begining locations in memory.
    • Index by name, id, index, or all three?
  • Keep last 20 rows accessed in a buffer queue.
    • Index by name, id, index, or all three?
    • Include a DiscardBuffer method?
  • When accessing a point not in memory, queue it.
  • MemoryMap mode specified at load.
  • Could be done with FileStream and StreamReader.
    • Use StreamReader.ReadLine().Length + 2 (DOS) or 1 (UNIX/Mac) to get the number of characters/bytes(ASCII) in a single line.
    • Combine with Partioning to save on seeking and reading.


  • Must know the size of the data set.
  • Create memory map on disk.
  • Save to a standard file.
  • Must have free space to hold memory map.

Large File abilities

  • Merge/Append two large files on disk with error checking.

Data partitioning

DataSet Partitioning to Save Memory

  • Emulate SAS behavior of loading only part of a file
  • Implement in the application when DataSets are loaded and used
    • Make multiple calls to generate scores for a single test
      • Exception: SAM
  • Add a constructor/method for loading a partial DataSet
  • Using IEnumerable, a foreach loop works:
        PartitionedDataSet pds = new PartitionedDataSet(filename);
        foreach (DataSet ds in pds)
              // do some work
    • DataSet has a method LoadPart for loading a set number of rows.
    • Score has a method MergeSave for saving the stats from the parts of the DataSet loaded with LoadPart.
    • These two could be combined with some wrapper object like PartitionedDataSet that would determine a good number of rows for each partition.
  • Add methods to DataSet/Score for an Append and/or Merge Write.
  • Combine with MemoryMap for large data?


  • How many parts should the data be divided into?
    • Fixed number of rows (variables).
    • Number of rows based on number of columns (observations) so as not to exceed memory limits.
  • Solution: MemoryMapDataSet
    • SAM uses the entire DataSet to compute s0 (fudge factor).
    • Distance requires the entire DataSet.

Other Memory Considerations

  • How can we save memory within the program?
    • Write scores to file and load when needed.

Run partitioning

Topic revision: r13 - 23 May 2005, WillGray

This site is powered by FoswikiCopyright © 2013-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Vanderbilt Biostatistics Wiki? Send feedback