Nomenclature

Archive Nomenclature

As of 2017, our archives follow the new nomenclature given below:

<DiseaseCohortName>.<TaskName>.<RunCode>.<Revision>


Element

Description of Permissible Values

DiseaseCohortName

A string of the form

    <project_name>-<disease_specification>[-<sample_type>]

for example: TCGA-ACC-TP.

The <disease_specification> most often refers to a single disease study given by its disease abbreviation , such as GBM for Glioblastoma Multiforme;  but may also refer to an aggregate of multiple diseases, such as PANCAN12 (which refers to a cohort of 12 diseases created to study pan-cancer trends) or COADREAD (which combines the single diseases COAD and READ into one cohort).

The optional <sample_type> suffix consists of a literal dash followed by a sample type code designating the tissue sample type; for example, the suffix "-TP" indicates that the given archive contains results based upon primary tumor data.  As a final example, here's how sample type codes  would most commonly map to sample sets in Firehose, for a single disease study:

Sample Set NameDescription
BLCAall tumor and normal samples for Bladder Urothelial Carcinoma (union of everything below)
BLCA-TPonly primary tumor samples
BLCA-TMonly metastatic tumor samples (if any)
BLCA-TRonly tumor recurrence samples (if any)
BLCA-NT

only tissue normal samples (if any)

BLCA-NBonly blood normal samples (if any)

TaskName

Tasks should be named as

    <Datatype>_<AlgorithmName>

For example: CopyNumber_Gistic2. The datatypes correspond to columns 2-12 in any of our sample data tables

with several types spelled out in longer form for clarity as follows:

Short FormLong FormDescription
CNCopyNumberSNP6 copy number data
LowPCopyNumberLowPassLow pass DNASeqC copy number data
MAFMutationmutation calls

RunCode

Eight numeric characters representing the date that the data was mirrored from GDC. For example, 20170807 indicates August 7, 2017.

Revision

A small integer (usually single digit) indicating how many times the given <TaskName> was successfully run in the given pass.


Legacy TCGA (prior to 2017)

DCC Archive Submissions

Each pipeline executed by the BROAD TCGA GDAC Firehose pipeline results in a set of 6 files being submitted to the DCC: primary results in the Level_* archive; auxiliary data (e.g. debugging information) in the aux archive, tracking information in the mage-tab archive; and an MD5 checksum file for each. In most cases you will only need the primary results in the Level_* archives. Microsoft Windows-based users can use the WinRAR utility to unpack the archive files, while Unix and Apple Mac OS/X users can use the gzip and/or tar utilities. As of January 2013 our archives follow the nomenclature given below.  Look here for older version.

<Domain>_<DiseaseCohortName>.<TaskName>.<DataLevel>.<Runcode>.<Revision>.0

 

Element

Description of Permissible Values

Domain

the literal string gdac.broadinstitute.org

DiseaseCohortName

a string of the form

<disease_specification>[<sample_type>]

The disease specification most often refers to a single disease study given by its TCGA disease abbreviation, such as GBM for Glioblastoma Multiforme;  but may also refer to an aggregate of multiple diseases, such as PANCAN12 (which refers to a cohort of 12 diseases created to study pan-cancer trends) or COADREAD (which combines the single diseases COAD and READ into one cohort).

The optional <sample_type> suffix consists of a literal dash followed by a TCGA short letter code designating the tissue sample type; for example, the suffix "-TP" indicates that the given archive contains results based upon primary tumor data.  As a final example, here's how TCGA short letter codes  would most commonly map to sample sets in Firehose, for a single disease study:

Sample Set NameDescription
BLCAall tumor and normal samples for Bladder Urothelial Carcinoma (union of everything below)
BLCA-TPonly primary tumor samples
BLCA-TMonly metastatic tumor samples (if any)
BLCA-TRonly tumor recurrence samples (if any)
BLCA-NT

only tissue normal samples (if any)

BLCA-NBonly blood normal samples (if any)

TaskName

Tasks should be named as

<Datatype>_<AlgorithmName>

For example: CopyNumber_Gistic2. The datatypes correspond to columns 2-12 in any of our sample data tables

with several types spelled out in longer form for clarity as follows:

Short FormLong FormDescription
CNCopyNumberSNP6 copy number data
LowPCopyNumberLowPassLow pass DNASeqC copy number data
MAFMutationmutation calls

DataLevel

the literal strings Level_2 or Level_3 for stddata tasks, or Level_4 for analyses tasks

Runcode

10 alphanumeric characters representing the date and a unique "pass" identifier, such as 2011072800 to indicate "pass 0" over the July 28,2011 data snapshot;  or 2011072801 to indicates "pass 1" over same dated snapshot

Revision

a small integer (usually single digit) indicating how many times the given TaskName was successfully run in the given pass