Large DataSet SubClass Pros and Cons
Description
For large dataset capabilities, we might need to make the read-only MemoryMapDataSet a subclass of DataSet.
Pros
- Polymorphism
- DataSet ds = new MemoryMapDataSet("file.txt");
Cons
- Inheritance
- Methods that modify DataSet will need to be disabled?
- Keep some of the same interface.
- Will need to change parts of the interface.
Alternatives
Interface instead of SubClass -- IDataSet
-
- List common methods between DataSet and MemoryMapDataSet
- GetDataPoint(int, int)
- NumCols
- NumRows
- Name
- GetDataPoint(int, int)
- GetColumnName(int)
- GetRowName(int)
- GetRowId(int)
- GetColumnIndexFromName(string)
- GetRowIndexFromId(int)
- Log ?
- DataSet.Log(int) -> Log(string, int)
- MemoryMapDataSet.Log(string, int)
- Average ?
- DataSet.Average(DataGrouping [, bool]) -> add Average(string, DataGrouping [, bool])
- MemoryMapDataSet.Average(string, DataGrouping [, bool])
- MemoryMap implementation of DataSet
- FileStream
- StreamReader
- Buffer
- Other DataSet members for column names and row ids/names, etc.
- Other TODO:
- Read-ahead optimization to reduce FileStream.Seek() usage
- Vary buffer size based on DataSet size (number of columns)
- Explicit Interface Implementation (see MSDN tutorial)
- Methods that can only be called with an interface reference to the object.
- IDataSet ds = new DataSet("file.txt");
- IDataSet ds = new MemoryMapDataSet("file.txt");
Issues
The Distance module uses DataSet.SetDataPoint to modify by the feature weights from the STD function. For this reason, Distance will still use DataSet. IDataSet.Filter will return a DataSet. After filtering, the DataSet size needs to be small enough to be held in memory.