Download

firehose_get   version 0.4.13 (released 2018_07_31)


Please note that downloading data from the Broad TCGA GDAC site constitutes agreement to this data usage policy.

To help simplify retrieval of TCGA data and analysis results we've introduced firehose_get.  To use it, download the latest zip file from here, perform these 2 steps from a Unix-compatible command line

        unix%   unzip firehose_get_latest.zip
  unix%  ./firehose_get 

and follow the instructions (documentation excerpt below).   If you are missing wget, please look here for links to pre-built versions for your system, or just Google it. Finally, rather than keeping firehose_get in the directory in which you downloaded and unzipped it, it's better to put it somewhere on your system where it can be found along your $PATH any time you might want to use it again, no matter what directory you might be working within.

Examples
  • firehose_get analyses latest
    Retrieves: every result, for every disease cohort, in the latest GDAC Firehose run
  • firehose_get -tasks mutsig gistic  analyses latest brca ucec
    Retrieves: only Gistic and MutSig results for breast and uterine cancer
  • firehose_get -tasks mut analyses latest prad
    Retrieves: all results which have "mut" in their name, such as MutSig, Mutation_Assessor, and any correlations to mutation data

  • firehose_get -tasks rna clinical stddata 2013_05_23
    Retrieves: any data package with (case-insensitive) "rna" or "clinical" in its name, from the May 23, 2013 data run
Documentation
%  firehose_get --help

firehose_get : retrieve open-access results of Broad Institute TCGA GDAC runs
Version: 0.4.1 (Author: Michael S. Noble)

Usage: firehose_get [flags]  RunType  Date  [disease_cohort, ... ]

Two arguments are required; the first must be one of
	analyses  awg_gbm  awg_hnsc  awg_lgg  
	awg_luad  awg_pancan8  awg_skcm  awg_stad  
	awg_test  awg_thca  stddata  

while the second must EITHER be a date (in YYYY_MM_DD form) of an
existing GDAC run of the given type OR 'latest'; use the -runs flag
to discern what RunType+Date combinations are available.  An optional
3rd, 4th etc argument may be specified to prune the retrieval, given
as a subset of these case-insensitive TCGA disease cohort names:

	ACC  BLCA  BRCA  CESC  COAD  COADREAD  DLBC  ESCA  
	GBM  HNSC  KICH  KIRC  KIRP  LAML  LGG  LIHC  
	LUAD  LUSC  OV  PAAD  PANCANCER  PANCAN8  PANCAN12  PRAD  
	READ  SARC  SKCM  STAD  THCA  UCEC  UCS  

(taken from https://tcga-data.nci.nih.gov/datareports/codeTablesReport.htm)
Note that as a convenience 'analysis' and 'data' are accepted as
synonyms for the 'analyses' and 'stddata' run types

Flags:
  -a | -auth [cred]   authorize the retrieval of password-protected
                      results; the optional cred[entials] parameter
                      must be one of
                              1) a username:password string
                              2) /a/path/to/a/wgetrc/file
                              3) the empty string
                      If no credentials are supplied (empty string),
                      then FHGETRC will be used if it is set in the
                      environment and points to a regular file (which
                      must be in WGETRC-conformant syntax); otherwise
                      a username:password prompt will be issued.  If
                      both $FHGETRC is set in the environment AND a
                      username:password parameter is specified here,
                      then $FHGETRC will be ignored
  -b | -batch         do not prompt: assume YES to all YES/NO queries
  -c | -cohorts       list available disease cohorts
  -e | -echo          show commands that would be run, but do nothing
  -h | -help | --help this message
  -l | -log           write output to log file, instead of stdout
  -o | -only <list>   further prune the set of archives retrieved, by
                      INCLUDING ONLY results of pipelines whose names
                      names match any of the given space-delimited list
                      of patterns; matching is performed with glob-style
                      wildcards, and is case-insensItive; prepending
                      a tilde (i.e. ~) to a task name will cause it
                      to be EXCLUDED from download; when no pattern
                      list is given firehose_get will display all tasks in
                      the selected run.
                      NOTE: not all tasks will execute for all disease
                            cohorts; what tasks are run depends upon the
                            data available for that disease cohort
  -p | -platforms     list data platforms available in Firehose runs
                      (not implemented yet)
  -r | -runs          list available Firehose runs
  -t | -tasks <list>  same as -o|-only flag (kept for back-compatibility)
  -v                  display the version of firehose_get
  -x                  debugging: turn on bash set -x (warning: very verbose)
Broad GDAC website:   http://gdac.broadinstitute.org
Broad GDAC email  :   gdac@broadinstitute.org
 Change Log
v0.4.13:
    Enable retrieval of GDAN internal analysis run for 2018_07_13 snapshot
v0.4.11 & 0.4.12:
    Enable retrieval of GDAN internal runs from 2017_09_19 and 2018_01_18
    (requires GDAN awg credentials)
v0.4.10:
    Enable download of .zip files from Firecloud runs
v0.4.9: 2017_08_22
    Patch submitted by gennady.margolin@nih.gov, to ensure post-processing
    completes even when dirnames contain spaces
v0.4.8: 2016_11_17:
    Support new GDC-style names for TCGA disease cohorts, which prepend TCGA-
    to the name; e.g. the ACC cohort in legacy TCGA becomes TCGA-ACC in GDC.
    This applies to all runs conducted after 2016_07_15, the formal end date
    of TCGA.
0.4.7: 2016_09_28
    Make tool slightly friendlier by filtering out more uninformative noise
    from wget, while iterating through downloads
v0.4.6: 2016_05_18
    firehose_get was originally developed on an ancient CentOS Linux system,
    but since then newer OS versions have become more stringent in enforcing
    security protocols; this release acknowledges that by referencing the
    GDAC site url as https:// instead of http://
v0.4.5: 2015_09_01
    Nicer feedback when some runs do not include data/results for all cohorts
v0.4.4: 2015_01_23
    When PANCAN12 was discontinued as a cohort, it broke -tasks listing here,
    so use THCA cohort instead
v0.4.3: 2013_11_08
    When possible, output better diagnostic message for wget errors
v0.4.2:     2013_09_26
    bugfix: recent edits broke -tasks list when run for "all" disease cohorts
0.4.1:      2013_09_03
    Turn off 'set -o pipefail' while ascertaining which downloader to employ,
    to better inform users when wget/etc cannot be found in $PATH
v0.4.0:     2013_08_25
    introduce -auth option, for retrieval of password protected (e.g. AWG) runs
v0.3.13:    2013_06_24
    introduce -only option, as clearer equivalent to -tasks
v0.3.12:    2013_06_04
    perform round trip to GDAC site less to determine valid run type/date
    warn user earlier when input does not map to an existing GDAC firehose run
    give "kinder" feedback when nothing was downloaded
v0.3.11:    2013_02_14
    help output is now shorter by default, when no args given
    -tasks flag now case insensitive
    tiny clarifications & improvements to help docs
v0.3.10:    2013_01_31
   remove hardcoded disease names, in favor of downloading from Broad site
   enhanced firehose_get_scan output (as was done for -runs below)
v0.3.9:     2012_12_22
   use firehose_get_scan tool (on Broad servers), and download its output
   to client sites, to speed up discovery for -runs flag, etc
v0.3.8:     2012_11_16
   support potentially any AWG with generic awg_<disease> run type
v0.3.7:     2012_10_21
   support PANCAN8 analysis working group (AWG) runs
v0.3.6:     2012_09_20
   -runs considers ONLY those GDAC runs with ./data subdir
v0.3.5:     2012_09_12
   discontinue use of static run lists, in favor of dynamically querying GDAC
                site to display list of runs, what kinds of runs, etc
    support EXCLUDE in -tasks with tilde/~ prefix
v0.3.4:     2012_09_07
    tweak date regex to correctly detect October months
v0.3.3:     2012_07_12
    fix printf msg emitted when nothing downloaded
    employ --cache=off, so that most up-to-date run lists are always retrieved
v0.3.2:     2012_06_08
    added -b/-batch for headless use
    'latest' now translated to date prior to download
    be less compulsive when cleaning up
v0.3.1:     2012_05_02
    accept --version, too
    use tumor types to subset list of tasks returned, too
    warn user when subsetted runs return nothing for the given tumor(s)
v0.3.0:     2012_04_22
    tweak awkward wording of -tasks help
    allow --help, too
    -runs flag to display list of available runs
    -tasks flag to subset by glob-pattern matching against task names
 Copyright and Disclaimer
#===============================================================================
# This software and its documentation are copyright 2012-2013 by the
# Broad Institute/Massachusetts Institute of Technology. All rights reserved.
#
# This software is supplied without any warranty or guaranteed support whatsoever.
# Neither the Broad Institute nor MIT can be responsible for its use, misuse, or
# functionality.
#===============================================================================

Related pages