At the Spring PGRN meeting last week, there were a number of interesting talks about the need for new databases to foster genetics research. One talk was from Scott Weiss on Gene Insight (see here). I gave a talk about our “RA Responder” Crowdsourcing Challenge (complete slide deck here). Here are a few general thoughts about the databases we need for genomics research.
(1) Silo’s are so last year
Too often, data from one interesting pharmacogenomic study (e.g., GWAS data on treatment response) are completely separate from another dataset that can be used to interpret the data (e.g., RNA-sequencing). Yes, specialized labs that generated the data can integrate the data for their own analysis. And yes, they can release individual datasets into the public for others to stitch together. But is this really what we need? Somehow, we need to make data available in a manner that is fully integrated and interoperable. One simple example of this is GWAS for autoimmune diseases. Since 2006, a large number of genetic data have been published. Still, there is no single place to go see results for all autoimmune diseases, despite the fact that there is tremendous shared overlap among the genetic basis for these diseases. Now, a series of manuscripts are being published on the Immunochip for diseases like inflammatory bowel disease (see here) and rheumatoid arthritis (see here). There is no one place to visualize these results, nor to integrate results with other genomic datasets, thereby limiting the value of these rich genomic datasets. Immunobase comes close (see here), but it has limitations.
(2) Unlocking clinical trial data
Academics are not the only ones who lock their data in a vault. The same is true of clinical trial data generated in industry. There is a trend to release clinical trial data to the public, as was recently announced by GSK (see here). [Access to this news article is limited to those with a subscription to Nature Reviews Drug Discovery. The irony is not lost on me!] This is a great first step, but if the clinical trial data remains silo’d from the genomic data, there will be limited value for discovery research.
(3) Empowering the community
Just because data are open access does not mean they are actually “accessible” to a community of scientists to analyze. Specialized software is required to bring the data together and the right community together. Towards this end, Sage Bionetworks (see here) has developed Synapse – “a collaborative compute space that allows scientists to share and analyze data together.” To demonstrate its power, Sage has hosted a competition pertaining to breast cancer survival. As posted on the Synapse website: “The goal of the breast cancer prognosis challenge is to assess the accuracy of computational models designed to predict breast cancer survival based on clinical information about the patient’s tumor as well as genome-wide molecular profiling data including gene expression and copy number profiles.” The manuscript was recently published Open Access in Science Translational Medicine (see here).
Just imagine an integrated database that includes genome sequencing, transcriptional profiling, and clinical outcomes, all made available to the general community for data analysis.
But who will make this happen?