Some tasks in the GDAC Firehose domain have both required and optional inputs, such as the
Aggregate_Molecular_Subtype_Clusters node in the Analyses workflow:
Any such task with optional inputs will not be launched by Firehose until 2 constraints are met:
All of its required inputs are available
No more of its optional inputs will ever become available
To satisfy constraint (2) Firehose looks recursively backward on the DAG, and delays launching a downstream task until it determines that there are zero remaining optional upstream tasks that can possibly be launched or executed to successful completion. As described below, this feature has been very helpful when running the GDAC analysis workflow, and its absence from FireCloud—which requires that only constraint (1) be satisfied before launching a task— implies that the current GDAC Analyses workflow cannot be ported “as is” from Firehose to FireCloud; to run on FireCloud this workflow and/or tasks within it will need to be modified, perhaps substantially. To understand how constraint (2) arose it’s helpful to examine (a) the structure of the data upon which the GDAC workflow operates, and (b) how that data is generally made available over time.
We use TCGA to make things concrete, but a similar pattern holds for the operational phase of many data-intensive scientific projects. First, the workflow is run for 38 independent disease cohorts (sets of patient samples); and each sample within a cohort may be characterized in as many as 10 different ways (yielding up to 10 distinct kinds of data, or data types, for each patient sample). The table at http://gdac.broadinstitute.org/runs/stddata__latest/ingested_data.png shows both the disease cohorts (rows) and data types (columns) in the corpus of TCGA data. Notice the heterogeneity in this data, in both the sizes of disease cohorts as well as the data types each of them offer. What the table doesn’t show, because the TCGA is done collecting data, is that at any given moment during the TCGA project many of the cells in this table were empty; so that, in general, when executing our analysis workflow on any given cohort some of the input arcs to the
Aggregate task might execute (because data aliquots of that type is available for that cohort) and some wouldn’t; and that the size of the data grows incrementally but unpredictably over time; in January there might be 0 expression aliquots for breast cancer (BRCA), so none of the expression analysis tasks would execute (which is why they are optional to the
Aggregate task). Then in March 100 expression aliquots might appear in the BRCA cohort, so that upstream expression analysis tasks would now be runnable, and Firehose would delay the
Aggregate task until they either complete or fail (whereas before March Firehose would know that those tasks could not be run, for lack of data, and therefore not postpone launching of the
Aggregate task for them). If no more expression aliquots are added to BRCA until August, then from April until August whenever we execute the Analyses workflow we expect all upstream expression analyses tasks to job avoid (as long as the versions of those tasks also remained the same, as did the data). Note that job avoiding upstream tasks does not mandate that
Aggregate would also avoid, because other arcs on the DAG might have become populated with data in the meantime (e.g. 231 copy number aliquots might have appeared, causing the copy number analysis tasks to be executed, et cetera). We hope it is now clear that the processes described here would be labor intensive and error prone to manage in a manual fashion, and thus why job avoidance and recursive lookback optionality have been valuable features for the automation of high-throughput GDAC analysis. We expect this to continue as new projects emerge in the Genome Data Analysis Network (GDAN, beginning in the fall of 2016), which collectively will be significantly larger than TCGA, more diverse in their goals, and more heterogenous in their data types and availability.
Note that when job avoidance is possible in FireCloud we would in principle be able to use brute force to execute the complete GDAC workflow, even without optionality. This approach would entail running the workflow over and over, in as many passes as needed until everything that can be run has been run. For example: suppose in Pass1 perhaps 60% of the workflow might execute “as is,” with downstream tasks launched as soon as their required inputs are available (no waiting for optional inputs that are not yet available); in Pass2 at most 60% of the workflow might job avoid, and perhaps more upstream tasks might complete for the first time, making a larger set of optional inputs available to downstream tasks, so that perhaps 80% of the entire workflow might complete; finally, in Pass3 up to 80% of the workfow might be avoided, and perhaps all optional upstream artifacts that could be generated were in fact generated, allowing the remaining 20% of the downstream workflow to complete.
As stated in our grant review, the NIH has expressed considerable interest in the use of FireCloud as a Global Platform for Collaborative Extreme-Scale Analysis, (as well as in its potential for solving forevermore the reproducibility problem for computational analyses). So we expect that workflows of the type we run now to become much more than "just for the Broad GDAC," but rather something that is directly utilized by numerous analysis working groups across the cancer research community. In addition, we are in the final stages of receiving funding for a GDAC-like center for proteomics (within the CPTAC), and are also using Firehose/FireCloud for GDAC-style analyses in the GTEX consortium. Altogether this means that our GDAC-style workflows have the potential for global impact across multiple scientific communities, going far beyond just one group at one institute, and effectively helping realize the original vision of the GDR and Prometheus at the Broad.