The Metaclinic Database System:

A Metadata Approach to Building Research Database Systems

 

 

Thomas Caldwell

Preston Research Building

Vanderbilt University Medical Center

Nashville, TN 37232

Phone: 615.936.2740

Fax: 615.936.1790

tom.caldwell@vanderbilt.edu

Keywords: metadata, oracle, medical research database, biomedical informatics

 


Abstract

 

The Metaclinic Database System is a metadata based system for creating normalized research databases. The apache application server and Oracle database engine support the applications at the lowest level. Access to data is available from application and developer interfaces utilizing standard browsers. Databases are created from a set of data forms built from sections and data elements. The main interface provides data entry, browsing, searching and reporting capabilities for the medical research community. Developers can use rapid prototyping methodology and advanced administration tools to reduce errors and increase productivity. Developers can create entire Metaclinic database applications without writing a single SQL statement. Metadata cached in the application server boosts application performance.

 

1. Introduction

 

In today’s modern world, medical research is being conducted on a much larger scale than has ever before been experienced. As research has expanded, the need for database systems to house research data has grown and expanded as well. New systems must provide easy access for data entry and analysis and provide tight control over that access. The ubiquitous nature of the world wide web makes it a natural choice for developers of these systems, but the asynchronous transactions that are used by web browsers are more difficult to design and blend into a cohesive application package. In addition, new HIPAA regulations that are now in effect are presenting new security challenges to developers. The need for restricted views of the database is now becoming a standard feature in the modern research environment. Furthermore, new database systems must be able to reach out and connect with other databases. Gone are the days that a single monolithic database can effectively house and control all of the data necessary in comprehensive medical research. And how will distributed databases communicate with each other? No one really knows yet.

The Metaclinic Database System is designed to create patient-centric, normalized databases from metadata that resides in separate database tables (metatables) that are transparent to the average database user.  

There are many advantages to systems that are based on metadata. From the metadata, web-based forms can be projected onto a standard internet browser for data entry, updates, and browsing. The extra abstraction layer provided by the metadata also facilitates the funneling of all database inserts and updates through a common library of subroutines which greatly increases database consistency and integrity. It also means that a Metaclinic application developer never has to write a single line of SQL code. The same metadata that is used to create the data entry forms can also be used to quickly generate and modify the tables which house the research data associated with the forms. The juxtaposition of html characteristics and database attributes creates a very tight coupling between the user interface and the backend database system and therefore helps to minimize the chance of programmer error. Once the forms are created from the metadata there are built-in reporting tools that are available immediately to the researchers and data managers. In fact, every data element in the research tables is available for reporting from this interface. Data managers can easily create simple reports for locating missing data and data entry errors.

Other built-in features exist as well. For the project manager and data analyst, there is also an advanced reporting interface that brings all of the power of the SQL database interface language to their finger tips. And when the need arises to transfer data to and from other databases and distributed data stores, the metadata can be used to ease the complexity and repetitive nature of these tasks.

 

 

2. System Architecture

 

The Metaclinic Database System is designed to run on an Apache webserver that includes the mod_perl module and standard DBI database interfaces. The backend database system can be Oracle (preferred) or MSAccess. Our current environment includes several Oracle 9i database systems running on Sun 220 and 240 servers and apache webservers version 1.x running on linux and Windows 2000 servers. Our Windows servers are presently being discontinued.

The main application scripts and modules are written in Perl and originally ran on apache webservers as standard CGI scripts. The simplicity of that environment was very attractive but as the size and complexity of applications began to grow, it soon became obvious that mod_perl would be a better operating environment due to the memory caching features that are available there. Metadata intrinsically wants to be cached and, indeed, the current software architecture takes advantage of memory caching to greatly improve performance on distributed servers.

 

3. User Interface

 

In the Metaclinic Database System, for each research project there is often, but not always, a single application interface to understand. This interface is divided into 3 areas of interaction. There is a control panel on the far left with various controls and options. Forms and reports are displayed in the larger right panel of the browser, and a small panel at the bottom displays general information and error messages to the user.

The control panel allows the user to select and browse data via a series of forms and data keys that have been tailored to a specific research project. Usually these forms will contain data related to a particular area of research as determined by the primary investigator or a team of researchers. Other controls on the control panel generate predefined reports or create new reports. New reports can draw from any of the research data elements included on any of the application forms. If the user decides to create a new report, then s/he selects the fields of interest and any data filtering for the report and proceeds to display or download the report results.

A report can also be named and saved for later use as a ‘canned’ report. The user also may have access to some special application reports that often involve special or detailed report formatting or data links to other applications. For example, letters to research participants can be easily generated and merged with specific data from the clinical database for subjects that meet a given set of criteria.

For the more technically minded user, there is also a button that generates a complete list of data elements with a map to the form or forms where the element resides as well as the database table where the data for this element is kept.

 

4. Forms

 

In a Metaclinic application the data form becomes the basic unit of interaction and primary link between the backend database and the application interface. This is where the abstract layer of metadata begins to come into play. Web forms are created by the Metaclinic Perl scripts from a hierarchy of data sections. Each section contains other sections and/or individual data elements. Individual data elements are the atomic units of data presentation and storage. The overall appearance of the form will be determined by the data elements, their labels and html text associated with the various sections that comprise the section hierarchy of the form.

For each form, there is an entry in the meta_form table that defines and describes it. A form has a form type, form browse order, and a primary section that designates the top of the section hierarchy. Each form also contains a database table name that becomes the default location for any data elements that have a 1:1 relationship with the primary key of the table. On most forms the patient table is used for this. Optionally, a form may also contain html header text, a list of javascipt modules that must be loaded when the form is loaded onto the browser, and other code that is also required when the form loads.

 

5. Sections

 

The primary section from the meta_form entry acts as the root for the section tree which defines the contents of each form. The metadata for defining sections is divided into two metatables – one to describe each section (meta_section table) and one to describe the contents of each section (meta_section_entry table).

The most interesting feature of a section is the section type. The most basic type of section is ‘simple’ which may contain other sections or individual data elements. Usually the root section of the form will be of this type and will provide a container for other sections that contain sections and data elements organized in a logical fashion. As each section in the tree is traversed and expanded, the features of the form unfold down the browser page. A section of type ‘readonly’ is very similar to ‘simple’ except that the contents of the section and all contained sections are viewable but not changeable.

Sections of this type are also used for grouping related data elements. For instance, a section called ‘patient_name’ might contain the data elements – firstname, mi, and lastname. This section could then be included in a standard form header section that resides in the primary section of all forms so that the patient name always appears the same and only appears in the metatables as a single section entry. And so a single change to the patient_name section would then be reflected on all of the forms. If the section were ‘readonly’, then none of the individual name fields would be modifiable on these forms.

In addition to ‘simple’ and ‘readonly’ sections, there are also three section types that allow for presenting data in various repeating formats - ‘multiple’, ‘fixed’ and ‘repeat’. Sections of these types always have a separate table associated with them to capture the multiple entries that are inherent to these section types. The primary keys for these tables are composite keys comprised of the patientid and a system generated item number. The patientid links the repeating entries back to the patient table. The item number performs two roles here. It combines with the patientid to satisfy uniqueness constraints and it provides a monotonically increasing numeric field which can be used to order row items in the repeating sections on the browser page.

A section that is type ‘multiple’ presents a group of data elements on a form in tabular or grid format. As in the other repeating section types, at least one of the data elements must be specified as ‘required’ which, when submitted to the Metaclinic application server, triggers the need to store the complete row of data into the database. If the required field were previously entered but later removed or erased, then the whole row would be removed from the database.

‘Fixed’ sections are closely tied to a codeclass. A ‘code’ is one of the basic data element types that will be discussed in more detail later. For now, a code is a system generated number that has a text associated with it, mainly for the purpose of presenting a controlled vocabulary to the user. A codeclass is a group of related codes. When associated with a ‘fixed’ section, for each code description in a codeclass there is a row on a grid. Other data elements in the ‘fixed’ section complete the rest of the columns.

‘Repeat’ sections are special in that they may contain ‘simple’ sections or one of the other repeating types already mentioned. This allows repeating data to be presented in a non-tabular format or as sets of repeating tabular data. This is particularly useful when the developer is designing a form which will capture the same tabular data on multiple dates and times. It also provides a way to capture and present tabular data that might be presented in a clumsy fashion due to the width of the data elements involved. So rather than forcing a user to scroll horizontally to complete a grid, the same information can be displayed in a more browser friendly fashion down the page.

Other fields exist in the meta_section table that enable and define various presentation aspects for sections. In particular, the number of table rows and a row numbering scheme may be defined.

 

6. Data Elements

 

Sections are comprised of individual data elements in addition to other sections. Data elements have a type of ‘simple’, ‘code’, ‘date’, or ‘label’ depending on the type of data that will be stored in the database. Data elements provide the fundamental bridge necessary between the representation and capture of data on the browser and the definition for the storage of that data in the backend database. Having this information in close proximity is a great asset to the application developer.

A ‘simple’ field can contain scalar string or numeric data and typically is displayed on the html browser as an input text, textarea, or checkbox field. A ‘code’ field, as mentioned earlier, is a system generated number that has an associated description text field. Several codes are grouped together into a codeclass that is assigned to a data element of type ‘code’. This has the overall effect of creating a controlled vocabulary for a data element. ‘Code’ elements are typically presented as dropdown lists or radio buttons on the browser page. Codeclasses may be reused, that is assigned to more that one data element. A ‘date’ element is used to represent date data. A single format is used to present ‘date’ elements which gives a consistent feel to the application.

For each of these data element types there is a separate metatable, but there are some common attributes that transcend them. All have an html text field, a database field name and database data type that are required and, optionally, a label. ‘Label’ elements are merely text descriptions that are used to enhance the form appearance. No data are actually stored in the research database in connection with these elements.

 

7. Database Architecture

 

From a database user’s point of view, the database that houses the research data should be a normalized database. Our goal is a database normalized to the 3rd normal form. This provides a consistent and flexible database for general use.

The same metatables that are used to build the html forms are used to generate the supporting normalized research database. Since most of the information that researchers are seeking centers on the patient, the developer can simply specify the patient table as the default table for every form. By doing this, all elements on a form that are not included in repeating sections are entered as fields in the patient table. The patient table thus acts as a ‘primary’ table for the research database. Since fields in the primary table may originate from data elements on multiple forms, form developers must be careful when assigning database field names to the individual data elements so that naming conflicts are avoided.

Data elements that are contained in repeating sections are stored in tables that are defined in the repeating section definition entry in the meta_section table. These tables always have a composite key, part of which is the link to the primary key of the primary table and another part which is a system generated number for uniqueness. If any unique information is being collected by a repeating section, additional unique key constraints for them may be defined in the meta_table_key metatable which is then passed on to the backend database as a database constraint.

The form developer is not necessarily confined to creating forms for gathering patient-centric data. Complete new interfaces have been developed using the same programming libraries and metatables with different browsing and searching schemes and different ‘primary’ tables.

Now, for each table there is an entry in the meta_table_key table for each key field that is needed. Composite keys require more than one entry and the field order must be specified. Also, there is usually a key generation facility associated with at least part of the primary key. In Oracle this would be a sequence object, and in Access a data type of autonumber would be used. This allows the systems to automatically generate unique key information for new entries

 

8. External Data Access

.

Tables are not the only source of data that forms can access and present. Data views into external databases can be created manually from within the backend database system and referenced by forms. This has been an excellent way for research groups to connect to other Oracle databases for browsing, searching and reporting.  For obvious reasons, these manually created views are usually ‘readonly’ with access controlled by the external databases. For reporting purposes there is also a metatable designed to relate fields from the local database to fields in the external sources utilizing a common key. At Vanderbilt, we have used this feature to connect our clinical research patient databases to our central tissue, microarray and proteomics databases using the tissue id and medical record number as common keys.

As forms can reach out to other databases for data, conversely other applications can reach into the Metaclinic research databases to access data. Thus, a developer can create a separate Metaclinic application that has access to a subset of the data of a Metaclinic research database. We have used this approach to create readonly, anonymous views (HIPAA compliant) of various research databases with sensitive information completely unavailable from the more restrictive application.

 

9. User Tools

 

The metatables and metadata provide a rich source of opportunity for creating tools for the application users (typically data managers and analysts) and developers. As mentioned earlier, the major tools for the application users are the data and form browsers and the user reporting tool built into the control panel on the main application page. The key fields for the data browser control are determined by the user and implemented by the developer independently for each Metaclinic application. Multiple keys may be specified for data browsing and searching, although only one can be active at a time. Typical keys would be ‘last name’, ‘medical record number’, or ‘study id’. The form browser determines which form is actually  presented. Using the two type of browsing methods together, the user can move from patient to patient while focusing on a single form or from form to form for the same patient.

The built-in reporting tool, also known as user reporting, represents a major time saver for the end user and developer as well. In many systems, reporting support often trails weeks or months behind data entry and storage. As a result, data managers are often blind to the overall contents of the database, and early database entry errors and anomalies can easily go unnoticed. Since the reporting tool can take advantage of the metadata, reporting is available as soon as a form is built or modified. The developer’s involvement during the early stages of application review is minimized because the data manager is not dependent on the developer for support. The same is true for the data analyst who can now easily download any data of interest without developer assistance. And, when form changes are necessary in the normal course of a project, the developer does not have to delve into the reporting library to retrofit any new changes. Of course, canned reports that reference obsolete data elements must be changed, but often that is an easy job for the original report creator. Also, for reports that include the primary key (patientid), there are html links automatically generated for each data element on the report that links to the form where the reported data element resides.

Other buttons on the control panel create a quick report of all patients and a complete database field index. The latter report provides the field label, field section, database table and other organizational information that are useful to the user and developer alike.

 

10. Developer Tools

 

The best tools are reserved for the developers. In the typical Metaclinic development environment (using Oracle) there is a schema and an apache server used only for development. There is also a schema and an apache server used only for production applications. In our environment, we actually use the same Oracle database server to support both types of schemas but we carefully devote our apache servers to only development or only production systems.

Developers use the application design tool to create forms by entering the various metadata attributes into the various metatables. Within the design tool, there are different interfaces for forms, sections, the various data element types, codeclasses, system defaults, table keys and advanced reporting entries. It is best to create a new form in an iterative fashion, adding small portions to a form at a time and then viewing the results from a browser.

When satisfied with the physical layout of the form, the developer uses the admin interface to generate or modify the database tables which house the data associated with the form. After storing some test data from the new or modified form, the developer is ready to upload the changes to the corresponding production system. This is easily accomplished by using the admin tool from the production application system which is linked via the apache Metaclinic configuration setup to the development database. When the metatables are uploaded, the production forms become exact copies of the development forms. Finally, the necessary production database tables are saved, then generated or modified and reloaded with the original data. The upgraded database is ready and available to the users. A simple form can easily be created and tested in an hour or two, although a complex form may sometimes take longer to complete. The upload process and table creation may take several seconds. If a table has been modified, usually the primary or patient table, saving the old data, dropping the old table, and regenerating and reloading the new table will often take several seconds as well.

An option available from the admin tool creates a script of SQL statements that will create all database production tables for an application. There is another option for moving individual forms and their dependent metadata from one Metaclinic application to another.

 

11. Import/Export Tools

 

The most recent addition to the Metaclinic toolbox is a separate interface for importing and for exporting data. Although both tools reside on the same interface, they actually work independently of each other.

The export facility is the simpler of the two. It merely creates and authorizes access to data views that reside on the underlying database system. An export is defined in two stages. First, the export table definition or definitions must be created. Then the export object is created and associated with one or more of the export table definitions.

An export table definition has a name and is associated with a particular Metaclinic database schema. It also has a list of data elements that are available from the data view for export. Optionally the date format for all ‘date’ entries may be specified, but usually it will be the Metaclinic date default from the defaults metatable.

Once the requisite export table definitions are created, the developer can group one or more into an export object. An export object has a name that can be referenced by an import object from another Metaclinic application. Associated with the export object, there is a list of database users and the access that each user has to the export. When the export object definition is finalized, the underlying data views and access lists are created for each of the export table definitions that are included in the export. If access is needed for a database user from an external database server, a local proxy user can be created and granted access to the export views. Access to the Metaclinic database schema should not be granted directly to export object users.

The import process is inherently more complicated. Data can arrive from different sources and different source types. In addition, the quality and consistency of the data may vary greatly from source to source and so possible errors must be handled. If data is imported into ‘code’ data elements then it may be necessary to create codeclass mappings for the incoming data. For versatility, date fields in various formats must be supported. Also, an import may be generating new rows in a table or it may be updating existing rows or both! In any event, like the export, an import is defined in the same two basic steps i.e. define the import source and define the import object in association with the import source.

An import source has a name and a source type. Currently the Metaclinic system supports three types of data sources - Metaclinic exports (described above), standard database views, and delimited flat files. Once the name and source type are defined, specific information for locating and accessing the source data is available to the definition process. Metaclinic exports and standard database views are very similar because both ultimately depend on access to data from database views. Typical information needed is the name of Metaclinic export object or view, username (schema) of the owner of the export object or view, and the DBI source information and proxy username if the database is located on another server. For delimited file sources, the filename with its directory and the delimiter character for the file are required. A checkbox denotes whether or not field names are included as the first line of the delimited file.

Once the data source is entered, the user can define the import object. The import object has a name and references a previously defined import source. Multiple import objects can reference the same import source. One or more local Metaclinic tables are then selected as the final repositories for the incoming data. If new entries in the primary table will be created as part of the import process, then the primary table should be the first table selected in the list of import tables. Now, for each of the possible import tables a mapping must be established between the import tables and the export sources. The connection information from the import source provides access to locate the exported data elements or fields. Next, each data element or field that is desired for this import is selected and a mapping established. As part of the field mapping process, table key maps must also be designated. Additional mappings must be performed for incoming data that are mapped to Metaclinic ‘code’ data elements.

At the same time, for each import table the developer can also set the date format, determine the update authority for the import (export or import), and designate the type of import (new only, update only, or both).

After all the preliminary work of defining the import object, the final import process only requires the push of a button for completion. But before that happens,  

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

the importer can view the external data as it exists on the export source or s/he can perform a preliminary import to determine if the import will go as planned. During the preliminary import, the export source is examined row by row and candidates for insertion and update are designated, and import errors are highlighted. There is also an import summary produced that captures these results.

 

12. Summary

 

Clearly, web accessible application systems that are based on metadata offer major advantages for users, developers and institutions. Users get a consistent and familiar interface as they move between forms, reports and applications. They also get timely access to their data. Since the applications are available from the web, updates and fixes can be delivered with a minimum of service interruption. Developers get the tools they need for rapid prototyping and easy maintenance without getting mired in the repetitive details that can cause database inconsistencies and mistakes. Institutions benefit because a single developer can support multiple database applications instead of just one or two. The potential for easily sharing data between disparate groups within a research facility and even between institutions is now within their grasps.