Multimodal, Federation Ready, Integrative Framework for Biomedical Data

May 29

Modern biomedical research is a data-driven enterprise that often requires integration of large-scale data from multiple disparate data sources, including genomics and other “omics” data, imaging data from pathology and radiology sources, and complex demographic data and phenotypic data such as EHR records. Storage, integration and analysis of these large and disparate data sets is a significant burden for researchers and institutions in the context of a single institution and becomes substantially more complicated when multiple institutions attempt to share data.

Security and privacy concerns further complicate data management and sharing, especially in clinically focused research using patient data, and create substantial barriers to mainstream use of real-world evidence (RWE) for biomedical research.

The datma.BASE biomedical information management platform is an integrated solution to the storage, integration and analysis challenges facing modern biomedical researchers. Utilizing highly optimized data storage technologies and compute management systems, datma.BASE allows researchers to ingest, access, analyze and even securely collaborate via federation with other researchers over large, integrated omics, imaging, and phenotypic/clinical data sets.

Automated ingestion pipelines utilizing industry standard technologies such as Fast Health Interoperability Resources (FHIR) and a unified API layer enable simplified ingestion, access and analysis on data that scales from a single laboratory to an entire institution, unifying data storage and access processes and eliminating many of the barriers to use of near real-time RWE for research.

In addition, datma.BASE can be extended with additional research and clinical use modules such as datma.360; all of these modules use datma.BASE as a core technology, enabling researchers and clinicians alike to utilize the same multi-modal data pool and same transformational translational analytics capabilities for research and patient care.

Overview

datma.BASE consists of three primary data stores (OmicsDS, ImageDS and RelDS), a management and execution layer that optionally supports federation via the datma.FED extension and an API layer that provides access to data ingestion, extraction and compute. All the components of datma.BASE are implemented in secured containers to enable simplified deployment in a customer tenancy, whether in the cloud, on-premises, or in a hybrid deployment model. All data and API access are secured with industry-standard encryption at rest and in motion and all transactions are logged for future audit.

OmicsDS

OmicsDS is a columnar data store optimized for storing “omics” data derived from sequencing, including genomics, transcriptomics, DNA methylation, copy number variation and other data types. Data can be stored at both a summarized level (e.g. variant calls for genomics, read counts for transcriptomics) or at the level of aligned sequences.

OmicsDS provides secured REST API access to its data, as well as native-level access for several programming languages, including JVM (Java/Scala), Python and R. Omics data can be quickly filtered for specific criteria within the data stores during a query (e.g, for only homozygous samples within a particular chromosomal range) and annotated on demand with integrated and customizable annotation support.

ImageDS

Like OmicsDS, ImageDS is a columnar data store for imaging that has been optimized for pixel/voxel data of any type. While adaptable to other imaging modalities, ImageDS is optimized specifically for pathology-related imaging and similar images generated by digital microscopy, including multichannel images generated by serial or simultaneous immunohistochemistry (IHC) staining. ImageDS’s API layer provides secure tile-based access to images at any scale, from whole-slide to 40x or more magnification, as well as support for annotation of images for e.g. regions of interest.

Image data can be served to an image viewer (such as that included with datma.360) or formatted for direct consumption by image classification networks, predictive algorithms and other such machine learning tasks.

EhrDS

EhrDS consists of a coordinated system of standard relational databases based on PostgreSQL that are designed to store phenotypic/EHR data in a standardized fashion and to coordinate sequencing and imaging data in OmicsDS and ImageDS with their originating subjects. EhrDS can also serve as a clearinghouse for coordinating and synchronizing access with other EHR systems via FHIR to pull in data that is not mirrored in EhrDS. As with OmicsDS and ImageDS, ehrDS APIs provide fully secured access to the underlying data.

Ingestion

Ingestion of data into datma.BASE and its underlying data stores is achieved through an automated and customizable pipeline that ensures synchronization between originating subjects/samples and their corresponding data in each data store within datma.BASE. Each data store’s API maintains REST endpoints for ingestion of data in various formats, including JSON, YAML, VCF, and other customizable formats. Ingestion tasks can be automated using the Prefect workflow management system, and all ingestion is logged for audit and data verification purposes.

Cohort

Cohort extends datma.BASE with an additional graph-based search database enabling multi-modal cohort searches across all data stores within datma.BASE. Cohort results can be stored for future use, such as for retrieval of specific attributes from any and all of the datma.BASE data stores to enable complex multimodal analyses; the API facilitates all these tasks from both programmatic (e.g. notebook) and graphical (e.g datma.360) user interfaces.

Federation

Secure exchange and pooling of data for analysis is a challenge in the biomedical space, especially when considering issues such as HIPAA compliance involved in RWE involving patient data and the sheer size of data involved in large omics and imaging data sets. The datma.FED extension mitigates these issues by enabling federated analysis at sites using datma.BASE.

By securely exchanging cohort specifications and verified analysis pipelines, datma.FED allows individual sites to query for cohorts and analyses on their local data and only transmit counts and analysis results to the originating site. This enables institutions using datma.FED to run complex multi-modal data analyses on a much larger and more diverse data pool than is available at any single institution, without ever having to deal with the security, privacy or technical concerns involved with centralizing a large multi-modal data set.

Applications

datma.BASE is designed to serve as a centralized storage and analysis platform that can scale from the laboratory level to serving entire institutions or networks of institutions connected via datma.FED. Interaction with the platform is similarly scalable, enabling scientists, analysts and clinicians to use the right interface and right amount of compute power for the job via datma.BASE’s interface, API and compute layers. For smaller and exploratory tasks, users can utilize datma.BASE’s APIs and write their own code in a notebook environment such as JupyterLab.

Larger jobs can be deployed to individual compute nodes and/or cluster compute resources such as Apache Spark or MPI resources managed by datma.FED. Local jobs can also be defined as workflows for datma.FED and reproduced on demand from notebook and scripts; workflows can even be integrated with the datma.360 clinical user interface to enable translation from the virtual bench to virtual bedside. Approved workflows can also be deployed over datma.FED to take full advantage of the larger data set and resources available over a datma.BASE federated network.

The datma.WHY causal modeling framework extension represents an example of a federated workflow leveraging federated analysis for secure and superior insights across multi-modal, multi-site data.

Summary

Managing “big” multi-modal biomedical data sets efficiently and securely is a very difficult task that causes significant drag on the progress of biomedical research, and the problem only multiplies when true patient-based RWE and multiple institutions are involved.

datma.BASE and its associated datma.FED extensions provide a standardized, efficient, secure and federation-ready solution for data management and analysis that scales from lab to population-size data across multiple modalities, freeing scientists, clinicians and analysts to concentrate on performing research, improving patient outcomes and pursuing novel solutions to biomedical problems that are simply not possible to consider without the power of the datma.BASE framework.

datma Team