AWG Runs

The Broad Institute GDAC will gladly coordinate with TCGA analysis working groups (AWGs) to provide custom Firehose runs tailored to their specific needs.  This represents an evolution of Firehose, beyond its original mission of monthly runs intended for archival storage at the DCC and wide public consumption beyond TCGA, by providing in-depth support to ongoing analysis efforts within TCGA.   This provides several "realtime value-added benefits" to AWGs:

  • Currency: pipelines can be run on the latest daily snapshot of data from the DCC, avoiding the time & sample lag of monthly runs
  • Flexibility:  additional runs can be easily performed on AWG-curated disease subtypes, and even include custom analyses
  • Speed:  custom AWG runs can be executed in only a few days time (excluding computationally intensive algorithms that may take >1 week to run)
  • Familiarity:  using the same internal Firehose machinery, and external-facing dashboards, Nozzle reports etc, already known to the community
  • Scope: is a stepping stone to open-access Firehose, that can be manipulated directly by TCGA researchers, instead of having runs curated only at the Broad Institute

Our custom AWG data runs can also be used to define a baseline AWG data freeze. These freeze products are ideally suited for sharing data across the various centers participating in a given AWG. Furthermore, all of the output archives produced by our custom AWG runs are easily obtained with firehose_get, in the same manner as the monthly runs. For a more in depth look at what we provide, take a look at our Feb 2013 presentation to the Lung Adenocarcinoma AWG.  Please contact us at gdac@broadinstitute.org or visit gdac.broadinstitute.org for more details. Our AWG runs are also reflected on the TCGA wiki.


See this TCGA Wiki page for ProgressSummary: TumorStatusReport Spreadsheet (more up-to-date and comprehensive than our internal chart below)

Internal Broad Staffing Prioritization Spreadsheet

 TCGA calendar for AWG telecons



  1. What constitutes an AWG run?   As summarize in our AWG run checklist, the products in an AWG freeze are

    1. A YYYY_MM_DD  version stamp (e.g. 2012_10_24), denoting when a data snapshot (to be frozen) was obtained from the DCC
    2. This can optionally contain a runcode suffix, such as _00 or _01, to denote additions/deletions to the base snapshot

    3. A sample list representing all of the data available in the snapshot

      A tab-delimited table (readable in either Emacs OR Excel, not only Excel) containing 1 row per aliquot, with at least the following columns:
    4. TCGA identifier (preferably full aliquot, and potentially UUID in the future)
    5. The corresponding Firehose identifier
    6. Sample type (per TCGA standard, TP=tumor primary, NB=normal blood, etc)
    7. The platform of origination on which the given aliquot data were collected (e.g. genome_wide_SNP_6 for copy number data)
    8. URL to the source file archived at the DCC in which that datatype-specific aliquot can be found
    9. URL to the corresponding SDRF for the given DCC archive file

      A heatmap representing platform coverage (per datatype per sample) will also be provided and linked on the AWG run dashboard.

    10. The results of corresponding stddata workflow executed upon the freeze samples

    11. The results of corresponding Analyses workflow executed upon the freeze samples, updated with annotations from stddata workflow

    12. A set of web-browsable Nozzle reports  (descriptions, figures, tables, etc) for (d)

      The individual Nozzle reports for each analysis task run against a given freeze sample cohort are aggregated into a single comprehensive report by the   gdac_reports tool.

    13. An online dashboard providing simple, central access to all of the above
    14. Located at  /xchip/gdac_data/runs/awg_<disease_name>__YYYY_MM_DD on the internal Broad filesystem
    15. Which corresponds to the online URL   http://gdac.broadinstitute.org/runs/awg_<disease_name>__YYYY_MM_DD
    16. The gdac_status tool is used to obtain the pass/faill/running/not-run status of all tasks in (d)

    17. A set of DCC-submission-ready archives containing (c), (d), and (e), retrievable from the dashboard with firehose_get

      The gdac2public tool is used to provision these DCC archives from the internal submission tree (into which they are packaged by Firehose) to the online dashboard location mentioned in (f)

  2. How do we generate these freeze products?
  3. Again, in the ideal case we'd simply use the canonical stddata & analyses runs produced on a monthly basis.  But conflicting delivery dates and algorithm versions means that is not always possible, so for PANCAN8 Mike prototyped FISS and python scripts that were also used for SKCM and THCA, and have now been generalized by Dan into a new Python tool.
    1. /wiki/spaces/GDAC/pages/844333857 tool in same GDAC bin location as fiss et al  (you will need to log in to confluence to see the gdac_freeze page)
    2. Can be run by anyone

  4. Who generates these freeze products?

    After the AWG chair appoints an analysis champion (AC) and data coordinator (DC), the AC and DC work in tandem to guide the freeze products through the process.  The DC is responsible for curating the sample list and communicating such to the AWG at large, while the AC is responsible for seeing that the analyses from the freeze list are sufficiently vetted, and communicating their availability to the AWG at large.  The GDAC engineering team will provide appropriate support as required.

  5. How can I cross-reference the samples in the Firehose AWG run I'm shepherding with those in the freeze list maintained by my AWG?

    1. Interactively: by inspection of the latest samples report.  If your AWG run has a different YYYY_MM_DD version stamp than the latest dicing, then look at the list of all sample reports.
    2. Programmatically:  use the gdac_counts and/or gdac_data CLI tools (use -h or -help options for usage instructions);  additional details can also be obtained with 'fiss sample_list ...'  or 'fiss annot_get ...'

  6. What do we do when something more is needed?

    Inevitably one of the analyses will be incorrect, or using an older version of code, or some samples will need to be added and/or deleted.  Or, after seeing an analysis result one would like to rerun parts of the workflow on a newly identified subtype or subset of the data (see below).  The need to address both of these flexibly is this is the strongest argument for NOT using canonical GDAC monthly runs, in favor of AWG-specific workspaces as created by the/wiki/spaces/GDAC/pages/844333857 tool.

  7. What if I want to do subtype-based analyses?

    All of the same machinery is used: create a fresh sample set which list the samples in your subtype (e.g. as output from a clustering algorithm);  for example, consider this TSV file which defines a small set of 8 samples (note that multiple subtypes may also be specified within a single TSV file).  The sample set will be named LGG-astrocytoma in Firehose (per the sample_set_id column), and can be loaded via this fiss command at the Unix prompt

                  %  fiss  sset_import <your_work_space> LGG-astrocytoma.tsv

or via the GUI.  This sample set can then be fed to all of the tasks in our data and analyses workflows, just as the complete LGG sample set (as downloaded from the TCGA DCC) would.  To make the purpose of your work clearer to downstream consumers, please follow the naming convention 

             <TCGA_Disease_Abbreviation>-<Subtype_Name>

strictly when naming subtypes or additional sample (sub)sets in the GDAC Firehose.  Here

             <Diease_Abbreviation>   is drawn from this list of TCGA disease abbrevations.  Aggregates, like COADREAD, are permissible

             <Subtype_Name>             should be alphanumeric, optionally using either CamelCase or underscore <_> to denote additional semantics

Note that because dash <-> and period <.> have important semantics in parsing and naming files for DCC submission, they should not be used in your <Subtype_Name>.  Some examples are:

              IDH_Wild_Type                good

              IDH-Wild-Type                  bad, contains dash

              Freeze_2.33                      bad, contains period

              Freeze_233                       good

Giving clear names to your subtypes/subsets will not only make the purpose of your work clearer to everyone who looks at it later, but following these conventions will also foster the (re)use of existing GDAC machinery for standard monthly runs (stddata, and Analyses) for AWG runs.  This includes the tools to initiate runs, check workflow status, package results, validate packaging, generate reports, publish via firehose_get, and submit to the DCC.