The Metaclinic Database System:
A Metadata Approach to Building Research
Database Systems
Thomas Caldwell
Phone: 615.936.2740
Fax: 615.936.1790
tom.caldwell@vanderbilt.edu
Keywords: metadata, oracle, medical research database, biomedical
informatics
Abstract
The Metaclinic Database System is a
metadata based system for creating normalized research databases. The apache
application server and Oracle database engine support the applications at the
lowest level. Access to data is available from application and developer
interfaces utilizing standard browsers. Databases are created from a set of data
forms built from sections and data elements. The main interface provides data
entry, browsing, searching and reporting capabilities for the medical research
community. Developers can use rapid prototyping methodology and advanced
administration tools to reduce errors and increase productivity. Developers can
create entire Metaclinic database applications without writing a single SQL
statement. Metadata cached in the application server boosts application
performance.
1. Introduction
In today’s modern world, medical research is being
conducted on a much larger scale than has ever before been experienced. As
research has expanded, the need for database systems to house research data has
grown and expanded as well. New systems must provide easy access for data entry
and analysis and provide tight control over that access. The ubiquitous nature
of the world wide web makes it a natural choice for developers of these
systems, but the asynchronous transactions that are used by web browsers are
more difficult to design and blend into a cohesive application package. In
addition, new HIPAA regulations that are now in effect are presenting new security
challenges to developers. The need for restricted views of the database is now
becoming a standard feature in the modern research environment. Furthermore,
new database systems must be able to reach out and connect with other
databases. Gone are the days that a single monolithic database can effectively
house and control all of the data necessary in comprehensive medical research. And
how will distributed databases communicate with each other? No one really knows
yet.
The Metaclinic Database System is designed to create
patient-centric, normalized databases from metadata that resides in separate
database tables (metatables) that are transparent to the average database user.
There are many advantages to systems that are based on
metadata. From the metadata, web-based forms can be projected onto a standard
internet browser for data entry, updates, and browsing. The extra abstraction
layer provided by the metadata also facilitates the funneling of all database
inserts and updates through a common library of subroutines which greatly
increases database consistency and integrity. It also means that a Metaclinic
application developer never has to write a single line of SQL code. The same
metadata that is used to create the data entry forms can also be used to quickly
generate and modify the tables which house the research data associated with
the forms. The juxtaposition of html characteristics and database attributes creates
a very tight coupling between the user interface and the backend database
system and therefore helps to minimize the chance of programmer error. Once the
forms are created from the metadata there are built-in reporting tools that are
available immediately to the researchers and data managers. In fact, every data
element in the research tables is available for reporting from this interface. Data
managers can easily create simple reports for locating missing data and data
entry errors.
Other built-in features exist as well. For the project
manager and data analyst, there is also an advanced reporting interface that
brings all of the power of the SQL database interface language to their finger
tips. And when the need arises to transfer data to and from other databases and
distributed data stores, the metadata can be used to ease the complexity and
repetitive nature of these tasks.
2. System Architecture
The Metaclinic Database System is designed to run on
an Apache webserver that includes the mod_perl module and standard DBI database
interfaces. The backend database system can be Oracle (preferred) or MSAccess.
Our current environment includes several Oracle 9i database systems running on
Sun 220 and 240 servers and apache webservers version 1.x running on linux and
Windows 2000 servers. Our Windows servers are presently being discontinued.
The main application scripts and modules are written
in Perl and originally ran on apache webservers as standard CGI scripts. The
simplicity of that environment was very attractive but as the size and
complexity of applications began to grow, it soon became obvious that mod_perl
would be a better operating environment due to the memory caching features that
are available there. Metadata intrinsically wants to be cached and, indeed, the
current software architecture takes advantage of memory caching to greatly
improve performance on distributed servers.
3. User Interface
In the Metaclinic Database System, for each research project
there is often, but not always, a single application interface to understand.
This interface is divided into 3 areas of interaction. There is a control panel
on the far left with various controls and options. Forms and reports are displayed
in the larger right panel of the browser, and a small panel at the bottom
displays general information and error messages to the user.
The control panel allows the user to select and browse
data via a series of forms and data keys that have been tailored to a specific
research project. Usually these forms will contain data related to a particular
area of research as determined by the primary investigator or a team of
researchers. Other controls on the control panel generate predefined reports or
create new reports. New reports can draw from any of the research data elements
included on any of the application forms. If the user decides to create a new
report, then s/he selects the fields of interest and any data filtering for the
report and proceeds to display or download the report results.
A report can also be named and saved for later use as
a ‘canned’ report. The user also may have access to some special application
reports that often involve special or detailed report formatting or data links
to other applications. For example, letters to research participants can be
easily generated and merged with specific data from the clinical database for
subjects that meet a given set of criteria.
For the more technically minded user, there is also a
button that generates a complete list of data elements with a map to the form
or forms where the element resides as well as the database table where the data
for this element is kept.
4. Forms
In a Metaclinic application the data form becomes the basic unit of interaction and
primary link between the backend database and the application interface. This
is where the abstract layer of metadata begins to come into play. Web forms are
created by the Metaclinic Perl scripts from a hierarchy of data sections. Each
section contains other sections and/or individual data elements. Individual
data elements are the atomic units of data presentation and storage. The
overall appearance of the form will be determined by the data elements, their
labels and html text associated with the various sections that comprise the
section hierarchy of the form.
For each form, there is an entry in the meta_form
table that defines and describes it. A form has a form type, form browse order,
and a primary section that designates the top of the section hierarchy. Each
form also contains a database table name that becomes the default location for
any data elements that have a 1:1 relationship with the primary key of the
table. On most forms the patient table is used for this. Optionally, a form may
also contain html header text, a list of javascipt modules that must be loaded
when the form is loaded onto the browser, and other code that is also required
when the form loads.
5. Sections
The primary section from the meta_form entry acts as
the root for the section tree which defines the contents of each form. The metadata
for defining sections is divided into two metatables – one to describe each
section (meta_section table) and one to describe the contents of each section
(meta_section_entry table).
The most interesting feature of a section is the
section type. The most basic type of section is ‘simple’ which may contain
other sections or individual data elements. Usually the root section of the
form will be of this type and will provide a container for other sections that
contain sections and data elements organized in a logical fashion. As each
section in the tree is traversed and expanded, the features of the form unfold
down the browser page. A section of type ‘readonly’ is very similar to ‘simple’
except that the contents of the section and all contained sections are viewable
but not changeable.
Sections of this type are also used for grouping
related data elements. For instance, a section called ‘patient_name’ might
contain the data elements – firstname, mi, and lastname. This section could
then be included in a standard form header section that resides in the primary
section of all forms so that the patient name always appears the same and only
appears in the metatables as a single section entry. And so a single change to
the patient_name section would then be reflected on all of the forms. If the
section were ‘readonly’, then none of the individual name fields would be
modifiable on these forms.
In addition to ‘simple’ and ‘readonly’ sections, there
are also three section types that allow for presenting data in various
repeating formats - ‘multiple’, ‘fixed’ and ‘repeat’. Sections of these types always
have a separate table associated with them to capture the multiple entries that
are inherent to these section types. The primary keys for these tables are
composite keys comprised of the patientid and a system generated item number.
The patientid links the repeating entries back to the patient table. The item
number performs two roles here. It combines with the patientid to satisfy
uniqueness constraints and it provides a monotonically increasing numeric field
which can be used to order row items in the repeating sections on the browser
page.
A section that is type ‘multiple’ presents a group of data
elements on a form in tabular or grid format. As in the other repeating section
types, at least one of the data elements must be specified as ‘required’ which,
when submitted to the Metaclinic application server, triggers the need to store
the complete row of data into the database. If the required field were previously
entered but later removed or erased, then the whole row would be removed from
the database.
‘Fixed’ sections are closely tied to a codeclass. A
‘code’ is one of the basic data element types that will be discussed in more
detail later. For now, a code is a system generated number that has a text
associated with it, mainly for the purpose of presenting a controlled
vocabulary to the user. A codeclass is a group of related codes. When
associated with a ‘fixed’ section, for each code description in a codeclass
there is a row on a grid. Other data elements in the ‘fixed’ section complete
the rest of the columns.
‘Repeat’ sections are special in that they may contain
‘simple’ sections or one of the other repeating types already mentioned. This
allows repeating data to be presented in a non-tabular format or as sets of
repeating tabular data. This is particularly useful when the developer is
designing a form which will capture the same tabular data on multiple dates and
times. It also provides a way to capture and present tabular data that might be
presented in a clumsy fashion due to the width of the data elements involved.
So rather than forcing a user to scroll horizontally to complete a grid, the same
information can be displayed in a more browser friendly fashion down the page.
Other fields exist in the meta_section table that enable
and define various presentation aspects for sections. In particular, the number
of table rows and a row numbering scheme may be defined.
6. Data Elements
Sections are comprised of individual data elements in
addition to other sections. Data elements have a type of ‘simple’, ‘code’,
‘date’, or ‘label’ depending on the type of data that will be stored in the
database. Data elements provide the fundamental bridge necessary between the
representation and capture of data on the browser and the definition for the
storage of that data in the backend database. Having this information in close
proximity is a great asset to the application developer.
A ‘simple’ field can contain scalar string or numeric
data and typically is displayed on the html browser as an input text, textarea,
or checkbox field. A ‘code’ field, as mentioned earlier, is a system generated
number that has an associated description text field. Several codes are grouped
together into a codeclass that is assigned to a data element of type ‘code’.
This has the overall effect of creating a controlled vocabulary for a data
element. ‘Code’ elements are typically presented as dropdown lists or radio
buttons on the browser page. Codeclasses may be reused, that is assigned to
more that one data element. A ‘date’ element is used to represent date data. A
single format is used to present ‘date’ elements which gives a consistent feel
to the application.
For each of these data element types there is a
separate metatable, but there are some common attributes that transcend them.
All have an html text field, a database field name and database data type that
are required and, optionally, a label. ‘Label’ elements are merely text
descriptions that are used to enhance the form appearance. No data are actually
stored in the research database in connection with these elements.
7. Database Architecture
From a database user’s point of view, the database
that houses the research data should be a normalized database. Our goal is a
database normalized to the 3rd normal form. This provides a
consistent and flexible database for general use.
The same metatables that are used to build the html
forms are used to generate the supporting normalized research database. Since
most of the information that researchers are seeking centers on the patient,
the developer can simply specify the patient table as the default table for
every form. By doing this, all elements on a form that are not included in
repeating sections are entered as fields in the patient table. The patient
table thus acts as a ‘primary’ table for the research database. Since fields in
the primary table may originate from data elements on multiple forms, form
developers must be careful when assigning database field names to the
individual data elements so that naming conflicts are avoided.
Data elements that are contained in repeating sections
are stored in tables that are defined in the repeating section definition entry
in the meta_section table. These tables always have a composite key, part of
which is the link to the primary key of the primary table and another part
which is a system generated number for uniqueness. If any unique information is
being collected by a repeating section, additional unique key constraints for
them may be defined in the meta_table_key metatable which is then passed on to
the backend database as a database constraint.
The form developer is not necessarily confined to
creating forms for gathering patient-centric data. Complete new interfaces have
been developed using the same programming libraries and metatables with
different browsing and searching schemes and different ‘primary’ tables.
Now, for each table there is an entry in the
meta_table_key table for each key field that is needed. Composite keys require more
than one entry and the field order must be specified. Also, there is usually a
key generation facility associated with at least part of the primary key. In Oracle
this would be a sequence object, and in Access a data type of autonumber would
be used. This allows the systems to automatically generate unique key
information for new entries
8. External Data Access
.
Tables are not the only source of data that forms can access
and present. Data views into external databases can be created manually from
within the backend database system and referenced by forms. This has been an
excellent way for research groups to connect to other Oracle databases for
browsing, searching and reporting. For
obvious reasons, these manually created views are usually ‘readonly’ with
access controlled by the external databases. For reporting purposes there is
also a metatable designed to relate fields from the local database to fields in
the external sources utilizing a common key. At Vanderbilt, we have used this
feature to connect our clinical research patient databases to our central
tissue, microarray and proteomics databases using the tissue id and medical
record number as common keys.
As forms can reach out to other databases for data,
conversely other applications can reach into the Metaclinic research databases
to access data. Thus, a developer can create a separate Metaclinic application
that has access to a subset of the data of a Metaclinic research database. We
have used this approach to create readonly, anonymous views (HIPAA compliant)
of various research databases with sensitive information completely unavailable
from the more restrictive application.
9. User Tools
The metatables and metadata provide a rich source of
opportunity for creating tools for the application users (typically data
managers and analysts) and developers. As mentioned earlier, the major tools
for the application users are the data and form browsers and the user reporting
tool built into the control panel on the main application page. The key fields
for the data browser control are determined by the user and implemented by the developer
independently for each Metaclinic application. Multiple keys may be specified
for data browsing and searching, although only one can be active at a time.
Typical keys would be ‘last name’, ‘medical record number’, or ‘study id’. The
form browser determines which form is actually presented. Using the two type of browsing
methods together, the user can move from patient to patient while focusing on a
single form or from form to form for the same patient.
The built-in reporting tool, also known as user
reporting, represents a major time saver for the end user and developer as
well. In many systems, reporting support often trails weeks or months behind
data entry and storage. As a result, data managers are often blind to the
overall contents of the database, and early database entry errors and anomalies
can easily go unnoticed. Since the reporting tool can take advantage of the
metadata, reporting is available as soon as a form is built or modified. The
developer’s involvement during the early stages of application review is
minimized because the data manager is not dependent on the developer for
support. The same is true for the data analyst who can now easily download any
data of interest without developer assistance. And, when form changes are
necessary in the normal course of a project, the developer does not have to
delve into the reporting library to retrofit any new changes. Of course, canned
reports that reference obsolete data elements must be changed, but often that
is an easy job for the original report creator. Also, for reports that include
the primary key (patientid), there are html links automatically generated for
each data element on the report that links to the form where the reported data
element resides.
Other buttons on the control panel create a quick
report of all patients and a complete database field index. The latter report
provides the field label, field section, database table and other organizational
information that are useful to the user and developer alike.
10. Developer Tools
The best tools are reserved for the developers. In the
typical Metaclinic development environment (using Oracle) there is a schema and
an apache server used only for development. There is also a schema and an
apache server used only for production applications. In our environment, we
actually use the same Oracle database server to support both types of schemas
but we carefully devote our apache servers to only development or only
production systems.
Developers use the application design tool to create
forms by entering the various metadata attributes into the various metatables. Within
the design tool, there are different interfaces for forms, sections, the various
data element types, codeclasses, system defaults, table keys and advanced
reporting entries. It is best to create a new form in an iterative fashion,
adding small portions to a form at a time and then viewing the results from a
browser.
When satisfied with the physical layout of the form,
the developer uses the admin interface to generate or modify the database
tables which house the data associated with the form. After storing some test
data from the new or modified form, the developer is ready to upload the
changes to the corresponding production system. This is easily accomplished by
using the admin tool from the production application system which is linked via
the apache Metaclinic configuration setup to the development database. When the
metatables are uploaded, the production forms become exact copies of the
development forms. Finally, the necessary production database tables are saved,
then generated or modified and reloaded with the original data. The upgraded
database is ready and available to the users. A simple form can easily be
created and tested in an hour or two, although a complex form may sometimes
take longer to complete. The upload process and table creation may take several
seconds. If a table has been modified, usually the primary or patient table,
saving the old data, dropping the old table, and regenerating and reloading the
new table will often take several seconds as well.
An option available from the admin tool creates a script
of SQL statements that will create all database production tables for an
application. There is another option for moving individual forms and their
dependent metadata from one Metaclinic application to another.
11. Import/Export Tools
The most recent addition to the Metaclinic toolbox is
a separate interface for importing and for exporting data. Although both tools
reside on the same interface, they actually work independently of each other.
The export facility is the simpler of the two. It merely
creates and authorizes access to data views that reside on the underlying
database system. An export is defined in two stages. First, the export table definition
or definitions must be created. Then the export object is created and
associated with one or more of the export table definitions.
An export table definition has a name and is
associated with a particular Metaclinic database schema. It also has a list of
data elements that are available from the data view for export. Optionally the
date format for all ‘date’ entries may be specified, but usually it will be the
Metaclinic date default from the defaults metatable.
Once the requisite export table definitions are
created, the developer can group one or more into an export object. An export object
has a name that can be referenced by an import object from another Metaclinic
application. Associated with the export object, there is a list of database
users and the access that each user has to the export. When the export object definition
is finalized, the underlying data views and access lists are created for each
of the export table definitions that are included in the export. If access is
needed for a database user from an external database server, a local proxy user
can be created and granted access to the export views. Access to the Metaclinic
database schema should not be granted directly to export object users.
The import process is inherently more complicated.
Data can arrive from different sources and different source types. In addition,
the quality and consistency of the data may vary greatly from source to source
and so possible errors must be handled. If data is imported into ‘code’ data
elements then it may be necessary to create codeclass mappings for the incoming
data. For versatility, date fields in various formats must be supported. Also, an
import may be generating new rows in a table or it may be updating existing
rows or both! In any event, like the export, an import is defined in the same two
basic steps i.e. define the import source and define the import object in
association with the import source.
An import source has a name and a source type. Currently
the Metaclinic system supports three types of data sources - Metaclinic exports
(described above), standard database views, and delimited flat files. Once the
name and source type are defined, specific information for locating and
accessing the source data is available to the definition process. Metaclinic exports
and standard database views are very similar because both ultimately depend on
access to data from database views. Typical information needed is the name of
Metaclinic export object or view, username (schema) of the owner of the export
object or view, and the DBI source information and proxy username if the
database is located on another server. For delimited file sources, the filename
with its directory and the delimiter character for the file are required. A
checkbox denotes whether or not field names are included as the first line of
the delimited file.
Once the data source is entered, the user can define
the import object. The import object has a name and references a previously
defined import source. Multiple import objects can reference the same import
source. One or more local Metaclinic tables are then selected as the final
repositories for the incoming data. If new entries in the primary table will be
created as part of the import process, then the primary table should be the
first table selected in the list of import tables. Now, for each of the
possible import tables a mapping must be established between the import tables
and the export sources. The connection information from the import source
provides access to locate the exported data elements or fields. Next, each data
element or field that is desired for this import is selected and a mapping
established. As part of the field mapping process, table key maps must also be
designated. Additional mappings must be performed for incoming data that are
mapped to Metaclinic ‘code’ data elements.
At the same time, for each import table the developer
can also set the date format, determine the update authority for the import
(export or import), and designate the type of import (new only, update only, or
both).
After all the preliminary work of defining the import
object, the final import process only requires the push of a button for
completion. But before that happens,
the
importer can view the external data as it exists on the export source or s/he
can perform a preliminary import to determine if the import will go as planned.
During the preliminary import, the export source is examined row by row and candidates
for insertion and update are designated, and import errors are highlighted.
There is also an import summary produced that captures these results.
12. Summary
Clearly, web accessible application systems that are
based on metadata offer major advantages for users, developers and institutions.
Users get a consistent and familiar interface as they move between forms,
reports and applications. They also get timely access to their data. Since the
applications are available from the web, updates and fixes can be delivered
with a minimum of service interruption. Developers get the tools they need for
rapid prototyping and easy maintenance without getting mired in the repetitive
details that can cause database inconsistencies and mistakes. Institutions
benefit because a single developer can support multiple database applications
instead of just one or two. The potential for easily sharing data between disparate
groups within a research facility and even between institutions is now within
their grasps.