Download

Download

firehose_get   version 0.4.13 (released 2018_07_31)


Please note that downloading data from the Broad TCGA GDAC site constitutes agreement to this data usage policy.

To help simplify retrieval of TCGA data and analysis results we've introduced firehose_get.  To use it, download the latest zip file from here, perform these 2 steps from a Unix-compatible command line

unix%   unzip firehose_get_latest.zip
  unix%  ./firehose_get 

and follow the instructions (documentation excerpt below).   If you are missing wget, please look here for links to pre-built versions for your system, or just Google it. Finally, rather than keeping firehose_get in the directory in which you downloaded and unzipped it, it's better to put it somewhere on your system where it can be found along your $PATH any time you might want to use it again, no matter what directory you might be working within.

Examples
  • firehose_get analyses latest
    Retrieves: every result, for every disease cohort, in the latest GDAC Firehose run

  • firehose_get -tasks mutsig gistic  analyses latest brca ucec
    Retrieves: only Gistic and MutSig results for breast and uterine cancer

  • firehose_get -tasks mut analyses latest prad
    Retrieves: all results which have "mut" in their name, such as MutSig, Mutation_Assessor, and any correlations to mutation data

  • firehose_get -tasks rna clinical stddata 2013_05_23
    Retrieves: any data package with (case-insensitive) "rna" or "clinical" in its name, from the May 23, 2013 data run

Documentation
%  firehose_get --help

firehose_get : retrieve open-access results of Broad Institute TCGA GDAC runs
Version: 0.4.1 (Author: Michael S. Noble)

Usage: firehose_get [flags]  RunType  Date  [disease_cohort, ... ]

Two arguments are required; the first must be one of
	analyses  awg_gbm  awg_hnsc  awg_lgg  
	awg_luad  awg_pancan8  awg_skcm  awg_stad  
	awg_test  awg_thca  stddata  

while the second must EITHER be a date (in YYYY_MM_DD form) of an
existing GDAC run of the given type OR 'latest'; use the -runs flag
to discern what RunType+Date combinations are available.  An optional
3rd, 4th etc argument may be specified to prune the retrieval, given
as a subset of these case-insensitive TCGA disease cohort names:

	ACC  BLCA  BRCA  CESC  COAD  COADREAD  DLBC  ESCA  
	GBM  HNSC  KICH  KIRC  KIRP  LAML  LGG  LIHC  
	LUAD  LUSC  OV  PAAD  PANCANCER  PANCAN8  PANCAN12  PRAD  
	READ  SARC  SKCM  STAD  THCA  UCEC  UCS  

(taken from https://tcga-data.nci.nih.gov/datareports/codeTablesReport.htm)
Note that as a convenience 'analysis' and 'data' are accepted as
synonyms for the 'analyses' and 'stddata' run types

Flags:
  -a | -auth [cred]   authorize the retrieval of password-protected
                      results; the optional cred[entials] parameter
                      must be one of
                              1) a username:password string
                              2) /a/path/to/a/wgetrc/file
                              3) the empty string
                      If no credentials are supplied (empty string),
                      then FHGETRC will be used if it is set in the
                      environment and points to a regular file (which
                      must be in WGETRC-conformant syntax); otherwise
                      a username:password prompt will be issued.  If
                      both $FHGETRC is set in the environment AND a
                      username:password parameter is specified here,
                      then $FHGETRC will be ignored
  -b | -batch         do not prompt: assume YES to all YES/NO queries
  -c | -cohorts       list available disease cohorts
  -e | -echo          show commands that would be run, but do nothing
  -h | -help | --help this message
  -l | -log           write output to log file, instead of stdout
  -o | -only <list>   further prune the set of archives retrieved, by
                      INCLUDING ONLY results of pipelines whose names
                      names match any of the given space-delimited list
                      of patterns; matching is performed with glob-style
                      wildcards, and is case-insensItive; prepending
                      a tilde (i.e. ~) to a task name will cause it
                      to be EXCLUDED from download; when no pattern
                      list is given firehose_get will display all tasks in
                      the selected run.
                      NOTE: not all tasks will execute for all disease
                            cohorts; what tasks are run depends upon the
                            data available for that disease cohort
  -p | -platforms     list data platforms available in Firehose runs
                      (not implemented yet)
  -r | -runs          list available Firehose runs
  -t | -tasks <list>  same as -o|-only flag (kept for back-compatibility)
  -v                  display the version of firehose_get
  -x                  debugging: turn on bash set -x (warning: very verbose)
Broad GDAC website:   http://gdac.broadinstitute.org
Broad GDAC email  :   gdac@broadinstitute.org
v0.4.13: Enable retrieval of GDAN internal analysis run for 2018_07_13 snapshot v0.4.11 & 0.4.12: Enable retrieval of GDAN internal runs from 2017_09_19 and 2018_01_18 (requires GDAN awg credentials) v0.4.10: Enable download of .zip files from Firecloud runs v0.4.9: 2017_08_22 Patch submitted by gennady.margolin@nih.gov, to ensure post-processing completes even when dirnames contain spaces v0.4.8: 2016_11_17: Support new GDC-style names for TCGA disease cohorts, which prepend TCGA- to the name; e.g. the ACC cohort in legacy TCGA becomes TCGA-ACC in GDC. This applies to all runs conducted after 2016_07_15, the formal end date of TCGA. 0.4.7: 2016_09_28 Make tool slightly friendlier by filtering out more uninformative noise from wget, while iterating through downloads v0.4.6: 2016_05_18 firehose_get was originally developed on an ancient CentOS Linux system, but since then newer OS versions have become more stringent in enforcing security protocols; this release acknowledges that by referencing the GDAC site url as https:// instead of http:// v0.4.5: 2015_09_01 Nicer feedback when some runs do not include data/results for all cohorts v0.4.4: 2015_01_23 When PANCAN12 was discontinued as a cohort, it broke -tasks listing here, so use THCA cohort instead v0.4.3: 2013_11_08 When possible, output better diagnostic message for wget errors v0.4.2: 2013_09_26 bugfix: recent edits broke -tasks list when run for "all" disease cohorts 0.4.1: 2013_09_03 Turn off 'set -o pipefail' while ascertaining which downloader to employ, to better inform users when wget/etc cannot be found in $PATH v0.4.0: 2013_08_25 introduce -auth option, for retrieval of password protected (e.g. AWG) runs v0.3.13: 2013_06_24 introduce -only option, as clearer equivalent to -tasks v0.3.12: 2013_06_04 perform round trip to GDAC site less to determine valid run type/date warn user earlier when input does not map to an existing GDAC firehose run give "kinder" feedback when nothing was downloaded v0.3.11: 2013_02_14 help output is now shorter by default, when no args given -tasks flag now case insensitive tiny clarifications & improvements to help docs v0.3.10: 2013_01_31 remove hardcoded disease names, in favor of downloading from Broad site enhanced firehose_get_scan output (as was done for -runs below) v0.3.9: 2012_12_22 use firehose_get_scan tool (on Broad servers), and download its output to client sites, to speed up discovery for -runs flag, etc v0.3.8: 2012_11_16 support potentially any AWG with generic awg_<disease> run type v0.3.7: 2012_10_21 support PANCAN8 analysis working group (AWG) runs v0.3.6: 2012_09_20 -runs considers ONLY those GDAC runs with ./data subdir v0.3.5: 2012_09_12 discontinue use of static run lists, in favor of dynamically querying GDAC site to display list of runs, what kinds of runs, etc support EXCLUDE in -tasks with tilde/~ prefix v0.3.4: 2012_09_07 tweak date regex to correctly detect October months v0.3.3: 2012_07_12 fix printf msg emitted when nothing downloaded employ --cache=off, so that most up-to-date run lists are always retrieved v0.3.2: 2012_06_08 added -b/-batch for headless use 'latest' now translated to date prior to download be less compulsive when cleaning up v0.3.1: 2012_05_02 accept --version, too use tumor types to subset list of tasks returned, too warn user when subsetted runs return nothing for the given tumor(s) v0.3.0: 2012_04_22 tweak awkward wording of -tasks help allow --help, too -runs flag to display list of available runs -tasks flag to subset by glob-pattern matching against task names

#=============================================================================== # This software and its documentation are copyright 2012-2013 by the # Broad Institute/Massachusetts Institute of Technology. All rights reserved. # # This software is supplied without any warranty or guaranteed support whatsoever. # Neither the Broad Institute nor MIT can be responsible for its use, misuse, or # functionality. #===============================================================================