Documentation

Table Of Contents

Downloading Firehose Data and Analysis Results

Searching our Analysis and Stddata Run Results

Adding Custom Data to Firehose: from external sources like other TCGA centers

Output Archive Nomenclature

Analyses Workflow in Firehose

  • Directed graph of GDAC Firehose analysis tasks: the name on each node corresponds to the pipeline task(s) executed at that step, and is also reflected in the output archive for that task (see Nomenclature above). Here is a live version of the same graph, in which clicking on a graph node will bring you to the Nozzle report for the respective analysis task.

 

Expression Microarray processing

  • Raw data (level 1), probe-level data (level 2), and gene-level data (level 3) of mRNA and miRNA expression data were downloaded from the DCC.
  • The data process was described at TCGA OV paper.

Clinical Data Processing

  • Historically, TCGA clinical data elements (CDEs) have been collected and distributed in XML form
  • But many bioinformatics algorithms are coded to operate upon 2D tables instead of XML tree structures
  • So to foster algorithmic analysis, Firehose transforms the CDEs from XML into 2D tables
  • These 2D tables are available in the Merge_Clinical set of archives created by our stddata runs
    • And should reflect the entire range of 700-800 CDEs available in TCGA (unioned across all disease cohorts)
  • To simplify downstream correlation analyses, Firehose then selects ("picks") a subset of approximately 60-80 CDEs
  • This subset of CDEs is available in the Clinical_Pick_Tier1 set of archives created by our stddata runs
  • The process by which this is done is described in our Clinical data processing workflow
  • Here is an interactive heatmap showing the current set of picked CDEs, broken down by disease cohort.
  • Here's an Excel spreadsheet showing the same thing.

     Click here to see how Firehose CDEs were improved in August 2015 (and why some CDE names changed)

    In August of 2015 we greatly expanded the set of clinical data offered by FireBrowse, so that it reflects the entire range of 700-800 CDEs collected by TCGA (instead of only the 60-80 CDEs picked by Firehose for automated anlaysis). As a result, the Clinical_Pick_Tier1 archives now bundle 2 forms of values:

    1. Entire set of TCGA CDEs, verbatim (in new All_CDES.txt file): adding over 700 additional clinical parameters

    2. In addition to the CDE subset normalized by Firehose for downstream analyses (in <cohort>.clin.merged.picked.txt file)

    3. For example, to date the ACC picked file has contained less than 20 CDEs while All_CDEs.txt now contains more than 100.

    In addition, the following improvements were made:

    1. Followup values are merged, when available, to yield the most up-to-date values per CDE
    2. Corrected problem wherein some True/False values for regimen_indication CDE were erroneously swapped
    3. Created an interactive CDE heatmap, which on a single page shows exactly what CDEs are selected for analyses in Firehose for all disease cohorts. 
    4. Updated the FireBrowse clinical samples API to reference this new CDE table
    5. Enhanced the Merge_Clinical pipeline to leverage auxiliary CDEs when available (COAD, READ, ESCA): 
      1. for all primary CDEs that also have a value in the aux CDE file (e.g. MSI), we now replace the primary value if it is NA and the aux value is not NA
     Click here for more internal details on the clinical data processing

    How to process clinical parameters of XML data in GDAC clinical picker pipeline
    : The clinical parameter names are concatenated with '.' with it's parent node names in the XML data.
    For all CDEs,

    1. truncated each name with '.' and took the last element as a parameter name.
    2. defined a parameter list to process.
      • took all CDEs starting with 'patient.*' and filter out parameters starting with 'admin.*', 'patient.samples.*' , 'patient.clinical_cqcf.*' and 'patient.biospecimen_cqcf.*' parameters.
    1. for a parameter name, generated a matrix having all clinical data having the parameter name. 
      • They are saved under /each_param/ and they are useful to locate a related parameter set having the same name.
    2. for each parameter name under /each_param/, processed and saved them in the All_CDEs_*.txt
      1. if the parameter has multiple followup data, it is processed to one parameter having the latest values.
      2. if the parameter has multiple but not followup data, it's additional event data are saved under /EXTRA/. 
    3. The All_CDEs_*.txt is used as an input for generating a *.clin.merged.picked.txt by the selectionFileGenerator. 
      • *.clin.merged.picked.txt sill has a small set of parameters suggested by pathologist for clinical correlation analysis. However, More parameters, which are not in *.clin.merged.picked.txt, are available in All_CDEs_*.txt and you can add them to the *.clin.merged.picked.txt for your clinical correlation test.

     Click here for information on clinical data from NCI

Clustering Pipelines

  • CNMF
  • iClusterPlus
    • Preprocessor for TCGA Broad GDAC input data
      • For the input maf file, the preprocessor generates a sample by gene aberration matrix and filter out genes of lower mutation rate.
      • For the input expression data generated in the mRNAseq preprocessor in stddata run, the preprocessor filters out genes of lower variance and generates a sample by gene matrix.
      • For the input copy number seg file, the preprocessor filters out the duplicate regions and generates a sample by gene matrix.
      • For each sample by gene matrix above, it generates a matrix only for intersection of samples across different platform data.
      • Further details are available in nozzle report.
    • Publication
    • Reproducibility of the result
      • According to the author of the R package, reproducing exactly the same percentEV plot is not guaranteed due to the randomness in MCMC-EM simulation.

RNAseq Pipelines

  • Clustering
  • RNAseq_RSEM_value
  • mRNAseq_preprocessor: Pick the "normalized_count"(quantile normalized RSEM) value from illumina hiseq/ga2 mRNAseq level_3 (v2) data set and make the mRNAseq matrix with log2 transformed for the downstream analysis. To maximize sample counts we include both HiSeq and GA2 aliquots in each cohort dataset, but if a given patient has both HiSeq and GA2 aliquots the HiSeq aliquot will take precedence (to avoid double-counting a patient during analysis). The pipeline also will create the matrix with RPKM and log2 transform from hiseq/ag2 mRNAseq level 3 (v1) data set.
  • Z score calculation of RSEM/RPKM data:
    Z = (expression in single tumor sample) - (mean expression in all tumor samples ) / (standard deviation of expression in all tumor samples)

miRseq Pipelines

  • miRseq_preprocess: Pick the "RPM"(reads per million miRNA precursor reads) from the illumina hiseq/aga mirnaseq Level_3 data set and make the matrix with log2 transformed. The preprocessor removes all records with NA values, which may lower the number of miRs utilized & reported during pipeline execution.
  • miRseq_mature_preprocess: Generate matrix with the mature strand value "reads per million miRNA mature reads" from the illumina hiseq/aga mirnaseq Level_3 data set. The mature strands have a MIMAT in the annotation, get all the isoforms of the mature strand by the annotation and sum up all the RPM value (1 sum for each mature strand in the sample), and then merge them into one table and do log2 transform.

mRNA Pipelines

  • mRNA_Preprocess_Median: Pick the matrix for the platform(Affymetrix HG U133, Affymetrix Exon Array and Agilent gene expression) with the largest number of samples and write it out.  

Methylation

  • Preprocessor (includes recent recommendations for improvement)

Mutation Pipelines

Copy Number Pipelines

RPPA Pipelines

Pathway Pipelines

Feature Table: Aggregate_AnalysisFeatures

The purpose of this pipeline is to aggregate the most important findings across ALL pipelines in the GDAC Firehose analysis workflow, into a single feature table. At present the feature table represents the samples by selected significant events (copy number alterations, somatic mutations, marker genes in each mRNAseq clustering subtype, clinical features and clustering results). The first column of the table is the sample id, with the remaining columns representing the analysis features as described here:

  • Clinical features: start with “CLI_” followed by feature name. The clinical file (*.merged.txt) was from the pipeline of Append_CustomClinical.
  • Clustering results: start with “CLUS_” followed by platform_method (e.g. CLUS_mRNAseq_cHierarchical). The cluster file (*.mergedcluster.txt) was from the pipeline of Aggregate_Molecular_Subtype_Clusters.
  • Somatic mutation genes: start with “SMG_” followed by version number( mutsig2.0,cv,2cv)_gene name (e.g. SMG_mutsig.2CV_FAM47C), as taken from the significant gene list (*.sig_genes.txt) produced by Mutsig2CV. The numbers in each row of a given SMG column indicate the type of mutation (with 0 denoting that no mutation was detected):

    1. Synonymous

    2. In-frame INDEL

    3. Other Non-synonymous

    4. Missense

    5. Splice Site

    6. Frameshift

    7. Nonsense

  • Somatic mutation genes expression: start with “SMG_” followed by gene name_mRNA (e.g. SMG_KRT3_mRNA). The mRNA expression (*.uncv2.mRNAseq_RSEM_normalized_log2.txt) was from the pipeline of mRNAseq_preprocessor.
  • Mutation rate: rate_non (non synonymous) and rate_sil (synonymous). The mutation rate (patient_counts_and_rates.txt) was from the Mutsig2CV.
  • Marker genes in each mRNAseq clustering subtype: star with “mRNA_” followed by CNMF_gene name_difference_cluster number (e.g. mRNA_CNMF_FAM66E_.0.6_2(In each cluster, the top 5 up regulated and top 5 down regulated genes were selected).

  • Significant copy number alterations as reported by GISTIC:
    • copy number focal amplifications: start with “Amp_” followed by cytoband (e.g. Amp_1q32.1)
    • focal deletion: start with “Del_” followed by cytoband (e.g. Del_1p36.32)
    • Arm level amplification: start with “CN_” followed by arm_Amp (e.g. CN_10p_Amp)
    • Arm level deletion: start with “CN_” followed by arm_Del (e.g. CN_10p_Del)
    • Copy number alteration gene with expression: start with “Amp/Del_” followed by gene name_cytoband_mRNA (e.g. Amp_SOX2_3q26.32_mRNA and Del_PARK2_6q24.3_mRNA)
• Supplemented with copy number altered genes in our master list built from PANCANER cnvs in Zack et al 2013 and COSMIC

Adding New Codes To Firehose

QC Pipelines