Some Issues and Current Recommendations Regarding Research Data Management, for Principal Investigators

There are many approaches to implementing a data management system for a research group. The following discussion is written from the perspective of observational research. Many investigators collect data using Microsoft Excel. This is only really appropriate when the entire database can be viewed on no more than 3 screens full of spreadsheets (to allow visual quality control), as Excel does not have data validation capabilities. Microsoft Access allows forms-based data entry with data validation. The open source PhOSCo system is a very promising alternative because it allows remote data entry and has a huge number of data checking and audit trail features. Open-source OpenOffice appears promising as a front-end forms-based data entry system for MySQL and other open-source database engines. Here are some issues to consider.
  1. The Department of Biostatistics emphasizes the use of open source software and will have limited access to Microsoft applications. Therefore we prefer to work with open source database applications whenever possible.
  2. Our goal is to have data systems in place that allow us to automatically create analysis files (e.g., for R). This will accomplish two things:
    1. minimize the time required to begin analyzing data
    2. creation of fully annotated analysis files that will result in statistical graphics and tables that are easier to interpret (with respect to value labels, variable labels, units of measurements, date, time, date/time variables, etc.)
  3. We believe that research databases should be stored on a Biostatistics server or other central School of Medicine server, for the following reasons:
    1. security - rigorous access control
    2. continuity - we always know where the data may be found when a faculty member, fellow, or resident leaves
    3. HIPAA compliance - we find that some research groups lose track of who has copies of the data (paper and electronic)
    4. backups - done daily with off-site storage of backup files in case of catastrophe
  4. We want all research databases we use to allow remote data entry from anywhere over the Internet, using either a web browser or a TCP/IP client that may be installed on any operating system.

Our Current Choices for Research Database Software

  1. PhOSCo for multi-site clinical trials or any application in which data queries and an audit trail are needed. This involves using a database engine on one of our servers (to be determined; Oracle is the one that PhOSCo works with out of the box but we are investigating open source database engines such as MySQL and PostgreSQL) and installing a Java TCP/IP application on each user workstation.
  2. OpenOffice for less complex applications. OpenOffice for Windows or Linux may be downloaded from and must be installed on each workstation. This will use the MySQL database engine on our server.


For Division that wish us to support some of the research database management needs, we need the appropriate percent efforts covered for a clinical data manager (who implements electronic case report forms then trains data entry personnel at the sites and monitors data quality) and a systems analyst. It may be possible to have clinical data managers employed in the Division learn how to implement the electronic data capture methods.

