Genome-scale footprinting from DNA sequencing of DNase Hypersensitivity (DHS) regions is an important step towards understanding gene regulation and functional annotation of variants in different tissues. Several approaches have been developed for identifying footprints, each with their own strengths and limitations; Despite the large amount of DHS data generated and processed by ENCODE, a unified pipeline analyzing all the samples does not exist. There have also been recent advances in footprinting algorithms that improve sensitivity and specificity of footprints. We created a high performance analysis pipeline for uniform processing of footprints on GRCh38 from three different methods: Wellington, HINT and PIQ. We compare the results of these three approaches to each other, using ChIP-seq data generated from ENCODE for 75 different transcription factors in lymphoblast cell lines. Using a machine-learning approach, we create a composite score for each footprint. As a community resource, we are generating a database of footprints using all three methods for 22 tissue types (all available ENCODE data). We identify footprints found in a majority of tissue types (>11/22) and show they are enriched in housekeeping genes. We developed TReNA to take footprints from one tissue type to create a genome-scale transcriptional regulatory network.
Using the BDDS tools, we created a easy-to-use interface that enables researchers to create a BDBag full of cell line data for a specific cell type (e.g. Brain). We also generate a minid for the BDBag and enter the minid into the analysis workflow in BDDS Galaxy service. For Brain, this resulted in a bag of 161 samples from 29 experiments. The service then retrieves all the data and runs the analysis at scale. The resulting footprints are then “bagged” per tissue type. We created a BDBag of all footprints for lymphoblast which are then used for machine learning. The results will be published to the Big Data Repository.