Nomenclature

Archive Nomenclature

As of 2017, our archives follow the new nomenclature given below:

<`DiseaseCohortNam`e>.<`TaskName`>.<`RunCode`>.<`Revision`>

Element

Description of Permissible Values

DiseaseCohortName

A string of the form

    <project_name>-<disease_specification>[-<sample_type>]

for example: TCGA-ACC-TP.

The <disease_specification> most often refers to a single disease study given by its disease abbreviation , such as GBM for Glioblastoma Multiforme; but may also refer to an aggregate of multiple diseases, such as PANCAN12 (which refers to a cohort of 12 diseases created to study pan-cancer trends) or COADREAD (which combines the single diseases COAD and READ into one cohort).

The optional <sample_type> suffix consists of a literal dash followed by a sample type code designating the tissue sample type; for example, the suffix "-TP" indicates that the given archive contains results based upon primary tumor data. As a final example, here's how sample type codes would most commonly map to sample sets in Firehose, for a single disease study:

Sample Set Name	Description
BLCA	all tumor and normal samples for Bladder Urothelial Carcinoma (union of everything below)
BLCA-TP	only primary tumor samples
BLCA-TM	only metastatic tumor samples (if any)
BLCA-TR	only tumor recurrence samples (if any)
BLCA-NT	only tissue normal samples (if any)
BLCA-NB	only blood normal samples (if any)

TaskName

Tasks should be named as

    <Datatype>_<AlgorithmName>

For example: CopyNumber_Gistic2. The datatypes correspond to columns 2-12 in any of our sample data tables

with several types spelled out in longer form for clarity as follows:

Short Form	Long Form	Description
CN	CopyNumber	SNP6 copy number data
LowP	CopyNumberLowPass	Low pass DNASeqC copy number data
MAF	Mutation	mutation calls

RunCode

Eight numeric characters representing the date that the data was mirrored from GDC. For example, 20170807 indicates August 7, 2017.

Revision

A small integer (usually single digit) indicating how many times the given <TaskName> was successfully run in the given pass.

Legacy TCGA (prior to 2017)

DCC Archive Submissions

Each pipeline executed by the BROAD TCGA GDAC Firehose pipeline results in a set of 6 files being submitted to the DCC: primary results in the Level_* archive; auxiliary data (e.g. debugging information) in the aux archive, tracking information in the mage-tab archive; and an MD5 checksum file for each. In most cases you will only need the primary results in the Level_* archives. Microsoft Windows-based users can use the WinRAR utility to unpack the archive files, while Unix and Apple Mac OS/X users can use the gzip and/or tar utilities. As of January 2013 our archives follow the nomenclature given below. Look here for older version.

<`Domain`>_<`DiseaseCohortNam`e>.<`TaskName`>.<`DataLevel`>.<`Runcode`>.<`Revision`>.0

Element

Description of Permissible Values

Domain

the literal string gdac.broadinstitute.org

DiseaseCohortName

a string of the form

<disease_specification>[<sample_type>]

The disease specification most often refers to a single disease study given by its TCGA disease abbreviation, such as GBM for Glioblastoma Multiforme; but may also refer to an aggregate of multiple diseases, such as PANCAN12 (which refers to a cohort of 12 diseases created to study pan-cancer trends) or COADREAD (which combines the single diseases COAD and READ into one cohort).

The optional <sample_type> suffix consists of a literal dash followed by a TCGA short letter code designating the tissue sample type; for example, the suffix "-TP" indicates that the given archive contains results based upon primary tumor data. As a final example, here's how TCGA short letter codes would most commonly map to sample sets in Firehose, for a single disease study:

Sample Set Name	Description
BLCA	all tumor and normal samples for Bladder Urothelial Carcinoma (union of everything below)
BLCA-TP	only primary tumor samples
BLCA-TM	only metastatic tumor samples (if any)
BLCA-TR	only tumor recurrence samples (if any)
BLCA-NT	only tissue normal samples (if any)
BLCA-NB	only blood normal samples (if any)

TaskName

Tasks should be named as

For example: CopyNumber_Gistic2. The datatypes correspond to columns 2-12 in any of our sample data tables

with several types spelled out in longer form for clarity as follows:

Short Form	Long Form	Description
CN	CopyNumber	SNP6 copy number data
LowP	CopyNumberLowPass	Low pass DNASeqC copy number data
MAF	Mutation	mutation calls

DataLevel

the literal strings Level_2 or Level_3 for stddata tasks, or Level_4 for analyses tasks

Runcode

10 alphanumeric characters representing the date and a unique "pass" identifier, such as 2011072800 to indicate "pass 0" over the July 28,2011 data snapshot; or 2011072801 to indicates "pass 1" over same dated snapshot

Revision

a small integer (usually single digit) indicating how many times the given TaskName was successfully run in the given pass

Broad TCGA GDAC

Nomenclature

<`DiseaseCohortNam`e>.<`TaskName`>.<`RunCode`>.<`Revision`>

Element

Description of Permissible Values

DiseaseCohortName

TaskName

RunCode

Revision

Legacy TCGA (prior to 2017)

<`Domain`>_<`DiseaseCohortNam`e>.<`TaskName`>.<`DataLevel`>.<`Runcode`>.<`Revision`>.0

Element

Description of Permissible Values

Domain

DiseaseCohortName

TaskName

DataLevel

Runcode

Revision

Related content

Nomenclature

<DiseaseCohortName>.<TaskName>.<RunCode>.<Revision>

Element

Description of Permissible Values

DiseaseCohortName

TaskName

RunCode

Revision

Legacy TCGA (prior to 2017)

<Domain>_<DiseaseCohortName>.<TaskName>.<DataLevel>.<Runcode>.<Revision>.0

Element

Description of Permissible Values

Domain

DiseaseCohortName

TaskName

DataLevel

Runcode

Revision

Related content

<`DiseaseCohortNam`e>.<`TaskName`>.<`RunCode`>.<`Revision`>

<`Domain`>_<`DiseaseCohortNam`e>.<`TaskName`>.<`DataLevel`>.<`Runcode`>.<`Revision`>.0