Cataloguing Discussion: Many BD2K groups are generating many intermediate objects, many in CA. How could we make a registry, how demonstrate best practices for transition from beta products to releases process how centers process locally -> collaborators -> commons there are DOIs, provenance tools... - lifecycle path/policy approach early: curation, not permanent, might not live for ever; but still useful to get what expected when a referrer points to data. Sage Bionetworks as model - on shared platform = access controls - Nature data journal where they host + 4k-5k word manuscript of metadata (works well with bioCADDIE/indexing) NIH: scientific flow is coolest/funnest, but is that the case that's going to be most immediate benefit? Carl: thinking of game changers, low hanging fruit... pass around identifiers also CA digital libraries, lightweight package transport format - bagit So, put two files in a bag, generate hash, refer to that. each version release has its own DOI NIH (all the bags could be included in progress report to demonstrate activity!) NIH "If you could do this and show it can work, that could be incredibly powerful" Ben: use hash as ID for in-progress data? Carl: Must be resolvable, questions of hashing big data sets... Data Quality? If we know a name that applies to a specific set of bits, and we have rules about versioning names, and bioCADDIE has QC rules, we can then do QC, generate new name when bits are changed Digital Objects - tools or data; for tools, it's common to debug and release next version. I really like the idea of treating the data sets the same way. If we see same name, we know we get same bits every time NIH - yay, reproducibility! Carl: not too fancy - get 1/2 million IDs or so, implement in CA centers as example, get BioCADDIE to accept (create?) bags. Ben: limiting case - short read sequence data generator? NIH: talk to Avi at LINKS BD2K - they need to deal with that question Carl: bag manifest is ascii file, so 10 million lines would be not good data model - you'd have to think about that. NIH: doing analysis on subset of large continuous data set (like weather) - Identifier is data set + algorithm for subset selection. Another format: ARC format has part of name IDed by server, different subset parseable by data host NIH: Implementation timeframe? Carl: library use agreements, etc... could have pilot in 3 months, maybe.... (Don't hold me to it!) Ben: How do IDs get catalogued/publicized? Identifiers in EasyID - they can resolve to institutional landing page, or bioCADDIE could do? Easy ID's format enables pointer to PII, so demonstrate work without violating DUAs Carl: proposed way forward: Use Aztek/bioCADDIE/BDDS - make working group, set up github repository - trello card-based task management a pre-working group activity - want to make something simple and broken, but done - not complex and impossible Action items: ID people from each center, set up working group, mailing list, trello board some data. Technical issues are small. Mostly plumbing - service hooks here and there. Name? BD2K Common Data Set Identifiers Pilot? CDSIP? CEDAR metadata - all starts with very thin metadata, then gets increasingly annotated over time, so this would be useful for intermediate steps. Similarly, metabolomic data aggregates over time. Would there be an "author" associated? Metadata does that. Perhaps ORCID iD? Carl - one other NIH question: If we do have to pay for address space, sign agreement, in ~$1k range... want separate prefix for this? Terms of use: someone outside of campus generating IDs in namespace a problem?