Rationale
Born of the desire to systematize analyses from The Cancer Genome Atlas pilot and scale their execution to the dozens of remaining diseases to be studied, GDAC Firehose now sits atop ~55 terabytes of analysis-ready TCGA data and reliably executes thousands of pipelines per month.
The Broad Institute TCGA GDAC Firehose Provides
Version-stamped, standardized datasets
Precursor to automated analyses: aggregates all available sample batches into a single, uniformly-formatted bolus (one per disease X datatype), which can be immediately fed to algorithmic codes without further data preparationVersion-stamped packages of standard scientific analysis results
Automatically generated for dozens of algorithms: GISTIC, MutSig, Clustering, Correlation, ...Version-stamped, biologist-friendly reports
Encapsulating analysis results in a form accessible to a wide audience, online for public browsing, and citable in the literature through DOIsVersion-stamped custom runs for TCGA analysis working groups
Performed by request in support of TCGA marker paper analysis, on a much shorter timescale than the monthly data runs and quarterly analysis runs.
These can be explored & retrieved interactively through our data dashboard and analysis dashboard, or downloaded en masse with firehose_get.Towards the aim of reproducibility, our online suite of reports provides thousands of pages of documentation for the analyses performed; in addition, extensive release notes are available for each versioned dataset and analysis package release. Finally, more information is available in many of the talks and posters we've presented and our online FAQ.
For a discussion of Firehose in the broader context of Big Cancer Data, see Nature Methods 10, 293–297 (2013) doi:10.1038/nmeth.2410