How to implement OMOP CDM and not die trying?
Having real-world data provides an ideal framework for conducting clinical research studies. However, selecting and preparing these data entails significant effort for both researchers and IT personnel within the organization. Having large volumes of quality data is also the foundation of any AI use within the organization. It is estimated that 80% of researchers' time and 50% of IT personnel's time is spent on data selection and preparation. Using standards can minimize this effort, providing researchers with tools for querying and creating cohorts without the intervention of IT personnel.
One of the most commonly used standardized models for real-world data research today is the Observational Medical Outcomes Partnership (OMOP CDM) model. In the European EHDEN project alone, there are currently 187 databases normalized in OMOP CDM, with 27 of them located in Spain and 4 in the Valencian Community.
The OMOP CDM standard provides a relational database model and a standard vocabulary model that allows for harmonizing data for easier reuse. The main effort required is to transform the data into the OMOP CDM standard from the data already recorded in existing EHR systems. The necessary steps to obtain a normalized database are as follows:
In the Valencian Community, we have collaborated with several organizations, applying this methodology for the construction and validation of OMOP CDM databases.
OMOP CDM Database | Organization | Number of patients |
HULAFE | Hospital Universitario La Fe | 2 274 159 |
Marina Salud Denia | Hospital de Denia Marina Salud | 314 587 |
ABUCASIS | INCLIVA | 4 014 819 |
VID-CONSIGN | FISABIO | 1 964 588 |
OMOP CDM Normalized Databases in the Valencian Community
The main dataset for populating an OMOP CDM database usually comes from relational databases and structured data (XML or JSON). However, OMOP CDM can also be populated from other data sources such as free text and medical imaging. At Veratech, we have participated in several projects that address these domains, such as the ChronicExtract project, where an OMOP CDM database has been populated with information from diabetic patients contained in narrative clinical notes. This project aims to develop a dashboard for diabetic patients where the OMOP CDM database centralizes all clinical information. Some relevant data is found exclusively within narrative clinical notes. It was necessary to use natural language processing techniques to find mentions of relevant clinical concepts. The mentions found were subsequently represented using OMOP CDM tables and vocabulary. Another source of information for training predictive models is clinical imaging. The OMOP CDM model has the radiology extension, which allows linking observational data from EHRs with medical image metadata. Veratech has participated in the Tartaglia project, where this extension has been used as the basis for training models with image and clinical variables.
Normalization to OMOP CDM provides advantages to clinical research, such as providing clear semantics to the data and improving its quality. It is true that the initial effort to perform this normalization is considerable, but once done, the advantages are evident. For each new clinical research, we will not have to dedicate time to data preparation and cleaning. OMOP CDM has the ATLAS environment that allows healthcare professionals to create patient cohorts from filters on the information stored in the database, without requiring intervention from IT personnel.
Normalization to OMOP CDM is also an opportunity to extract existing knowledge from free text clinical documents and stored images. Processes can be implemented to analyze this data to extract or annotate clinically relevant concepts about patients' health.
Finally, if OMOP CDM expands to more hospitals and care centers, we will have a unique opportunity to create a federated research network on real-world data based on OMOP CDM in the Valencian Community. By sharing the same information base, multicenter clinical studies can be carried out, sharing even queries and parameter definitions for the construction of research cohorts. And this can be done not only at the regional level but also allow participation in national and international research with minimal effort for managing clinical data.
Authors