BIG DATA for DISCOVERY SCIENCE
infographic
The Big Data for Discovery Science Center (BDDS) - comprised of leading experts in biomedical imaging, genetics, proteomics, and computer science - is taking an "-ome to home" approach toward streamlining big data management, aggregation, manipulation, integration, and the modeling of biological systems across spatial and temporal scales.
 
 

Preserving Provenance with Minids

Many bioinformatic analysis workflows are built by running a series of computational steps, often with different software executions and varying parameters. As we construct bioinformatic workflows (as in the TReNA approach), we have found it useful to capture our working sessions to maintain logs of screen output from subprocesses and individual steps. Often, we simply use the Unix command "tee" to generate an output file, such as this:

echo 'helo' | tee -a test.txt

Thus, executing a workflow becomes self-documenting, and the computational provenance of a particular result can be maintained by referring to the appropriate output file. This reference is greatly facilitated by the BDDS MINID client tools. Now, we can mint a MINID for the output file, share that minid among analysts working to develop a workflow, and include the MINID in a results database so that we capture the specific workflow we applied for a given bioinformatic analysis.