Tutorial Proposal for the Metaclinic Database System

An Introduction to the Metaclinic Database System

Author: Tom Caldwell

686 Preston Research Building

Vanderbilt University Cancer Center

Nashville, TN 37232

Phone: 615.936.2740

Fax: 615.936.1790

Keywords: metadata, oracle, medical research, web application, biomedical informatics

Introduction

In today’s modern world, medical research is being conducted on a much larger scale than has ever before been experienced. As research has expanded the need for database systems to house research data has grown and expanded as well. New systems must provide easy access for data entry and analysis and provide tight control over that access. The ubiquitous nature of the world wide web makes it a natural choice for developers of these systems, but the asynchronous transactions that are used by web browsers are more difficult to design and blend into a cohesive application package. In addition, new HIPAA regulations that are now in effect are presenting new security challenges to developers. The need for restricted views of the database is now becoming a standard feature in the modern research environment. Furthermore, new database systems must be able to reach out and connect with other databases. Gone are the days that a single monolithic database can effectively house and control all of the data necessary in comprehensive medical research. And how will distributed databases communicate with each other? No one really knows yet.

The Metaclinic Database System is designed to create patient-centric, normalized databases from metadata that resides in separate database tables (metatables) that are transparent to the average database user. There are many advantages to systems that are based on metadata. From the metadata, web-based forms can be projected onto a standard internet browser for data entry, updates, and browsing. The extra abstraction layer provided by the metadata also facilitates the funneling of all database inserts and updates through a common library of subroutines which greatly increases database consistency and integrity. The same metadata that is used to create the data entry forms can also be used to quickly generate and modify the tables which house the research data associated with those forms. This creates a very tight coupling between the user interface and the backend database system and therefore helps to minimize the chance of programmer error. There is also a built-in reporting interface that uses the metadata. In fact, every data element in the research tables is available for reporting from within this interface. Data managers can easily create simple reports for locating missing data and data entry errors. For the project manager and data analyst, there is also an advanced reporting interface that brings all of the power of the SQL database interface language to their finger tips. And finally, when the need arises to assimilate data from distributed data sources the metadata can be used to ease the complexity and repetitive nature of this task as well.

System Architecture

The Metaclinic Database System is designed to run on an Apache webserver that includes the mod_perl module and standard DBI database interfaces. The backend database system can be Oracle (preferred) or MSAccess. Our current environment includes several Oracle 9i databases running on Sun 220 and 240 servers and apache 1.xx running on linux and Windows 2000 servers. Our Windows servers are being phased out.

Originally, the metaclinic applications ran as standard Perl CGI code. The simplicity of that environment was very attractive but it was obvious that mod_perl would be a better environment due to the caching features that are available. And, indeed, the new architecture takes advantage of an internal memory cache of the metadata to greatly improve performance.

The User Interface

From the end user’s point of view, for each research project there is often just one main interface to understand. This interface is divided into 3 areas of interaction. There is a control panel on the far left with forms displayed in the larger right panel of the browser, and a small panel at the bottom that displays general information and error messages to the user. The control panel allows the user to select and browse data via a series of forms and data keys that have been tailored to a specific research project. Usually these forms will contain data related to a particular area of research as determined by a primary investigator or a team of researchers. Also included on this panel are controls that generate ‘canned’ reports or create new reports from any of the research elements included on any of the forms. If the user decides to create a new report, then s/he selects the fields of interest and any data filtering for the report and proceeds to display the report and/or download the results to a tab delimited file. The report can also be named and saved for later use as a ‘canned’ report. The user also may have access to some special application reports that often involve special or detailed report formatting or data links to other applications. For example, form letters to research participants can be easily generated for subjects that meet a certain criteria.

Form Architecture

The form becomes the basic unit of interaction between the backend database and the users. Now the abstract layer of metadata begins to come into play. Forms are composed of a hierarchy of sections. Each section contains other sections and/or individual data elements.

For each form, there is an entry in the meta_form table that defines and describes it. A form has a form type, form browse order, and a primary section that designates the top of the section hierarchy. Each form also contains a database table name that becomes the default location for any data elements that have a 1:1 relationship with the primary key of the table. This is usually the patient table for most forms. Optionally a form may also contain html header text, a list of javascipt modules that must be loaded when the form is loaded onto the browser, and other code that is also required when the form loads.

The primary section from the meta_form entry acts as the root for the section tree which defines the contents in each form. The metadata for defining sections is divided into 2 metatables – one to describe each section (meta_section table) and one to describe the contents of each section (meta_section_entry table). The most interesting feature of a section is the section type. The simplest type of section is type ‘simple’ which may contain other sections or data elements. Usually the primary section will be of this type and will contain other sections. As each section is traversed and expanded, the features of the form will unfold down the browser page. Sections of this type are also used for grouping related data elements. For instance, a section called ‘patient_name’ might contain the data elements – firstname, mi, and lastname. This section could then be included in a standard form header section that resides on all forms so that the patient name always appears the same to the user and only appears in the metatables as a single section entry. A section of type ‘readonly’ is very similar to ‘simple’ except that the contents of the section and all subsections are presented as readonly data.

In addition to ‘simple’ and ‘readonly’ sections, there are also 3 section types that present data in various repeating formats - ‘multiple’, ‘fixed’ and ‘repeat’. A section that is type ‘multiple’ presents a group of data elements in tabular or grid format. As in the other repeating section types, at least one of the data elements must be specified as ‘required’ which, when submitted to the database, triggers the need to store a row of data into the database. If the required field was previously entered but later removed or erased, then the whole row would be removed from the database. ‘Fixed’ sections are closely tied to a code class, which is just a group of codes. A ‘code’ is one of the basic data element types that will be discussed in more detail later. For now, a code is a system generated number that has a text associated with it, mainly for the purpose of presenting a controlled vocabulary to the user. When associated with a ‘fixed’ section, for each code description in a code class there is a row on a grid. Other data elements in the section complete the rest of the columns. ‘Repeat’ sections are special in that they may contain ‘simple’ sections or one of the other repeating types already mentioned. This allows repeating data to be presented in a non-tabular format or as sets of repeating tabular data. This is particularly useful when the developer is designing a form which will capture the same tabular data on multiple dates and times. It also provides a way to capture and present tabular data that might be unwieldy due to the width of the data elements involved. So rather than forcing a user to scroll horizontally, the information can be displayed in a more browser friendly fashion down the page.

Other fields exist in the meta_section table that enable and define various presentation aspects for sections, particularly the number of table rows and a row numbering scheme may be defined.

Besides other sections, sections are comprised of individual data elements. Data elements have a type of ‘simple’, ‘code’, ‘date’, or ‘label’ depending on the type of data that will be stored in the database. A ‘simple’ field can contain string or numeric data and typically is displayed on the html browser as an input text field, textarea, or checkbox. A ‘code’ field, as mentioned earlier, is a system generated number that has a designated description text field. Several codes are grouped together into a code class that is assigned to a data element of type ‘code’. This has the overall effect of creating a controlled vocabulary for a data element. ‘Code’ elements are typically presented as dropdown lists or radio button on the browser page. Code classes may be reused, that is assigned to more that one data element. A ‘date’ element is used to represent date data. A single format is used to present date elements which gives a consistent feel to the application. For each of these element types there is a separate metatable, but there are some common attributes that transcend them. All have an html text field, a database field name and database data type that are required and optionally a label. ‘Label’ elements are merely text descriptions that are used to enhance the form appearance. No data are actually stored in the research database in connection with these elements.

Database Architecture

From a database user’s point of view, the database that houses the research data should be a normalized database. This provides the most consistent and flexible database possible. The same metatables that are used to build the html forms are used to generate the supporting normalized research database. Since most of the information that researchers want is focused on the patient, the developer can simple specify the patient table as the default table for every form. Thus all elements on a form that are not included on repeating sections are entered as fields in the patient table which acts as a ‘primary’ table for the database. Since fields in the primary table may originate from data elements on multiple forms, form developers must be careful when assigning database field names for the individual data elements. Tables that are defined with repeating sections always have a composite key, part of which is tied to the primary key of the primary table. When a researcher wants to gather data that is not patient centric, whole new interfaces can be developed using the same metatables and programming libraries.

Now, for each table there will also be an entry in the meta_table_key table for each key field that is needed. Composite keys require more than one entry and the field order must be specified. Also, there is usually a key generation facility associated with at least part of the primary key. In Oracle this would be a sequence object, and in Access a data type of autonumber may be used. This allows the systems to automatically generate unique key information as necessary.

But tables are not the only source of data that forms can access and present. Data views into external databases can be created manually from within the backend database system and referenced by forms. This has been an excellent way for our research units to connect to other Oracle databases. For obvious reasons, these manual views are usually ‘readonly’ with access controlled by the external databases. For reporting purposes there is also a metatable designed to relate fields from the local database to fields in the external sources utilizing a common unique key. We have used this feature to connect our clinical research patient databases to our central tissue, MircoArray and Proteomics databases using the tissue id and medical record number as common keys.

And as forms can reach out to other databases for data, conversely other applications can reach into a metaclinic research database to access data. So, a developer can create a separate application that has access to a subset of the data of a metaclinic research database. We have used this approach to create readonly, anonymous views (HIPAA compliant) of various research databases with sensitive information completely unavailable.

Tools, Tools, Tools

The metatables and metadata provide a mother lode of opportunity for creating tools for the application user (typically data manager and analysts) and developer. As mentioned earlier, the major tools for the application users are the data and form browsers and the user reporting tool built into the control panel on the main application page. The key fields for the data browser are determined by the user/developer independently for each application and multiple keys may be specified for data browsing, although only one can be selected at a time. Typical keys would be ‘last name’, ‘medical record number’, or ‘study id’. The form browser determines which form is actually be presented. Used together, the user can move from patient to patient while focusing on a single form.

The built-in reporting tool, also known as user reporting, represents a major time saver for the end user and developer as well. In many systems, reporting often trails weeks or months behind data storage. As a result, data managers are left blind to the overall contents of the database. Since the reporting tool can take advantage of the metadata, reporting is available as soon as a form is built or modified. The developer’s involvement during the early stages of application review is minimized because the data manager is not constantly tugging at the developer’s sleeve for support. The same is true for the data analyst who can now easily download any data of interest without developer assistance. And, when form changes are necessary, the developer does not have to delve into the bowels of a reporting library to retrofit those changes. Of course, canned reports that reference obsolete data elements must be fixed, but often that is an easy job for the original report creator. Also, for reports that include the primary key (patientid), there are html links automatically generated for each data element on the report that links back to the form where the data element resides. There are also buttons on the control panel that create a quick report of all patients and a complete database field index with field label, section, database table and other information. These are useful for the user and developer alike.

But the best tools are reserved for the developers. In the typical metaclinic development environment (using Oracle) there is a schema and an apache server used only for development. There is also a schema and an apache server used only for production applications. In our environment, we actually use the same Oracle database server to support both types of schemas but we carefully devote our apache servers to only development or only production systems. Developers use the application design tool to create forms by entering metadata into the various metatables. It is best to create a new form in an iterative fashion, adding small portions to a form at a time and then viewing the results from a browser. When satisfied with the physical layout of the form, the developer uses the admin interface to generate or modify the database tables which house the data associated with the form. After storing some test data from the new or modified form, the developer is ready to upload the changes to the corresponding production system. This is easily accomplished by using the admin tool from the production application system which is linked via the apache configuration setup to the development database. When the metatables are uploaded, the production form becomes a carbon copy of the development form. Then the necessary production database tables are generated or modified and reloaded as before and the users are back in business. A simple form can easily be created and tested in an hour or two. The upload process and table creation may take several seconds. If a table has been modified, usually the primary or patient table, saving the old data, dropping the old table, and regenerating and reloading the new table will often take several seconds as well.

An option available from the admin tool creates a set of SQL statements that will create all database production tables. Another option creates by form all SQL statements for inserting metadata and thereby duplicates a form with all it’s components into another metaclinic system. And, the most recent addition to the metaclinic toolbox is a separate new application interface for importing and exporting data from the production database. The export interface creates and authorizes access to a data view. The import process can upload new data or merge with existing data from other metaclinic sources or delimited files.

Summary

Clearly, application systems that are based on metadata offer a big win for users and developers. Users get a consistent and familiar interface as they move between applications and they get access to their data. Developers get the tools they need to do their jobs quickly and easily without getting mired in the repetitive details that can create inconsistencies and mistakes. Institutions benefit because a single developer can support many different database applications instead of just one. And the potential for easily sharing data between disparate groups within a research facility and between Institutions is within their grasps.