GDAN_GDC_Notes
The page is a central reference of information and notes for computational biologists and software engineers in the CGA group, as we transition from the TCGA era of GDAC to the GDC/GDAN era. As of July 2016, the Genomics Data Commons has replaced the TCGA Data Coordination Center as the repository of not only TCGA data but also for other existing genomics projects (such as TARGET), as well as future genomics projects.
Reference Data
For GDAN pipelines we will store on-premises reference data in
/xchip/cga/reference/GDAN.
This is analogous to the TCGA reference directory
/xchip/cga/reference/tcga
but is not TCGA-centric. In addition to having hg38 reference data, the GDAN reference tree will gather the bits and pieces of "hidden data" that for expedience has been accidentally squirreled away in less than ideal locations.The first entry in the GDAN reference tree is ./GDAN/miR/miRSeqpreprocess/mature.21.fa.gz, which is used in the miRSeq preprocessor.
- Ideally the reference directory will be migrate to a cloud bucket and referenced in cloud-based analysis pipelines, but that will take time.
Pipeline Construction and Algorithm Coding Guidelines
We have a window of opportunity between the end of TCGA and the beginning of GDAN, to return to best practices for pipeline and software development.
- Review this paper http://www.americanscientist.org/issues/pub/wheres-the-real-bottleneck-in-scientific-computing, which was the basis for the Software Carpentry workshop series of best coding practices for scientists.
- BIG Takeway: by adopting 10% of the daily habits of experienced SWEs, scientists and their collaborators can become MUCH more productive and confident in their results.
- To that end we are going to resume SWE/CB pair-programming efforts.
- After a pipeline is installed to gdc_devworkspace and configured and successfully tested:
- Schedule review of algorithm code (e.g. R, Python, Matlab, etc) with a SWE on staff
- More discussion is needed, but other ideas include:
- Create templates for R/Python/Matlab, with pre-defined sections (e.g. description of algorithm, description of inputs, description of outputs)
- With an eye towards being able to EXTRACT those descriptions programmatically INTO the output reports that are later generated
- Every pipeline should write a provenance.txt file describing inputs and outputs:
- Input_1_<param_name> = ...
- Input_2_<param_name> = ...
- ...
- Output_1_ = ...
- Output_2_ = ...
- More to come ...