BIG DATA for DISCOVERY SCIENCE
infographic
The Big Data for Discovery Science Center (BDDS) - comprised of leading experts in biomedical imaging, genetics, proteomics, and computer science - is taking an "-ome to home" approach toward streamlining big data management, aggregation, manipulation, integration, and the modeling of biological systems across spatial and temporal scales.
 
 

PheWAS Approach

This project aims to develop advanced framework, tools and software to extract and manage big data of specific gene variants and a wide variety of neuroimaging phenotypes and to implement sophisticated statistical analysis and result visualization for neuroimaging PheWAS on big data. The methods and tools will be applied to conduct broad surveys on extensive neuroimaging genomic data for discovery of system-level true associations of SNPs of interest with the brain.

First, raw phenotype, genetic, and neuroimaging data are loaded into the DERIVA data catalog using an "Extract, Transform, Load" process. This process involves the ingest of data (the “Extracts”) from one or more tabular data files in CSV format representing phenotypic data for an entire study cohort, one or more tabular files of genetic variant data in the VCF format, and MRI neuroimaging files in various formats such as DICOM, NIFTI, MGZ, etc. During this process, various software utilities are used to perform the “Transforms” required to prepare the data for loading into the catalog. Specifically:

  1. Tabular data in CSV format is scanned for structure using csvkit, and corresponding SQL is created to automate the creation of a database table and the subsequent import of the CSV data into the newly created table
  2. Genotypes of interest are extracted from VCF files using bcftools and written out to a tabular CSV file in an optimized format.
  3. Image files are processed by a BDDS developed utility that recursively scans an input directory for images, collecting any metadata encountered, and populates an object store (such as a native filesystem, Hatrac, or Amazon S3) with the file data while also creating the relevant Deriva catalog data for referencing the files.

After the transform process is complete, the transformed data is “Loaded” into the Deriva catalog, and then annotated. The annotation step is a Deriva-specific process which facilitates further customization and fine-tuning of how data is presented to the user in the Deriva web-based user interface.

Once the base data has been loaded, the Deriva web-based user interface can be used to query the catalog for subjects with data on the genotype of interest. Neuroimaging data for subjects matching the search criteria are extracted into BDBag formatted archive files. Next, the neuroimaging data is processed through comprehensive volumetric and surface-based analysis (FreeSurfer) on the LONI pipeline. This produces derived digital image processing outputs (generally in MGZ format) and volumetric measurements in tabular form that are loaded back into the data catalog in the same manner as the initial data load, i.e., via the same ETL processing.

Finally, target genotype and phenotype data is extracted and output from Deriva in BDBag format. The BDBag includes the following:

  1. Genotype and covariates (e.g. demographic and behavior data) in .csv format.
  2. The regional phenotypes (namely, one value per region) in .csv format.
  3. The local phenotypes (namely, one value per brain surface vertex) in .mgh format (FreeSurfer).

The contents of the BDBag can now be loaded into the NeuroimagingPheWAS Toolbox for phenome-wide association study to make genotype-to-phenotype associations.