CPTAC Meeting Agendas & Notes

Oct 29, 2018:

  1. Just a reminder that PDC has issued an alpha of it's MVP release.  Would be nice to have more Broadies take a look.  Presently supports only CPTAC2 data.

  2. iCoMut:
    1. undergoing update for UCEC AWG
      1. FYI:  HG38 mutation calls underway by WashU (slated for 11/8 V2 freeze)
    2. waiting for Broad example data to add new proteomics tracks
  3. HG38:  progress, GDCtools config file needed
    1. CN data:  


    2. But what is more biologically important:  WGS for CN or HG38?

      1. Because at present only VCFs available for WGS CN data

      2. Feedback:  Mani says WES is fine for Broad PG pipeline

      3. Karl needs WGS for some of his investigatory work (non-canonical ORFs, from HLA), but that's not ready for prime time sharing with other PGDACs yet, so it doesn't impact our G-CDAP

  4. LUAD: new proteomic data "in progress," but in "new location" @ DCC:
    1. per this email exchange
    2. Changed from/to:

      1. 8_CPTAC3_Lung_Adeno_Carcinoma/LUAD_mzML/

      2. 8_CPTAC3_Lung_Adeno_Carcinoma/A_CPTAC3_hg38_CDAP/LUAD_mzML/

    3. Per Nathan: "The files under A_CPTAC3_hg38_CDAP represent the re-run of the CDAP
      analysis for files analyzed before the release of Broad's new
      RefSeq/hg38 protein sequence database and any new ones going forward."

Sept 17, 2018:  

  1. Upcoming milestones:

    1. TCGA Legacy symposium (9/26-9/29, posters starting this week)
    2. CPTAC F2F (Oct 15-18):
      1. for F2F our only licensing concern is access to the CPTAC communit
      2. so each user who wants access sends email to us, and is granted, b/c they are members of CPTAC consortiumn
    3. Paper on proteomics pipeline (Dec 2018)
  2. These all imply that pipelines should be made public-ready in near future

    1. Definition: releasing pipeline "in FC" is treated effectively as releasing source code
    2. In part because docs usually live in repo wiki, and so it must be opened to public for (unless docs are duplicated to elsewhere)
    3. Do we need to have an "I Agree" dialogue box for codes which require licenses?  For example, on codes which come from other labs.
    4. At the very least, for EXTERNAL codes used in our pipeline OUR codes which need commercial tools (like MatLab REs)
    5. For proteomic pipeline, Mani says it's ALL OK, everything can be run, for whatever reason, by anyone
    6. But we need to go over EVERY task in GDAC genomic pipeline to ensure all components (especially external codes) are releasable
      1. Including APOBEC, GSEA (and also tools like Philosopher from proteomic pipeline)
      2. We may want to release "genomic lite" pipeline, with certain tools omitted
    7. Possible interim solution:  until "I Agree" is possible at runtime in FC, to prevent commercial entities from profiting:
    8. Karl made strong point about re-distribution being what most method developers want to control/prevent, but how?
    9. Mike pointed out that while preventing re-distrib is important, mere "runnability" also needs to be monitored/protected:
    10. otherwise commercial entities can take our labor/tools (or even re-brand as their own) & then profit by analyzing/etc their data
    11. Chet pointed out that GenePattern team wrestled with this years ago, eventually taking the approach:  "We can't enforce good behavior,"  but we can record whether LICENSE terms have been acknowledged & record in a database, as well as record of each instance when a any user ran any tool.  Then let the lawyers fight it out if need be.
    12. Mike suggested that enforcing acknowledgement of LICENSE terms can be done with a boolean attribute or config parameter, which defaults to False, but user sets to True to signify that LICENSE terms have been read and/or accepted.  This facilitates "I Agree" dialog box suggested above for FireCloud UI, but also covers the programmatic use case.
  3. Is it time for us to Hydrant-ize our GDAN configs, which would benefit both genomic & proteomic pipelines (GDAN & CPTAC)

    1. Mani & Karsten considered this for proteo-genomic portion of pipelines, and decided NO, not until after Oct 2018 F2F (and possibly Dec paper)

    2. CGA then adopted similar position.

  4. Prioritization:
    1. This is substantial effort: really need extra SWE effort
    2. Also: we do not have staffing to respond to wide-ranging GISTIC, MutSig etc user questions
  5. Lastly:  policy on sharing methods/workspaces with collaborators, e.g. Kwiatkowski lab
    1. if the pipeline is instantiated in our GDAN (or CPTAC workflow) 
    2. and we provide access only through FireCloud
    3. then the method (and/or method configuration) is essentially for public consumption 
    4. and may be shared through FC with essentially anyone?
    5. but, in particular, Broadie collaborators (as a first step)

August 6, 2018

  1. Budget alert: 50% gone?

  2. Methylation data progress
  3. SOP slides
  4. Anybody here of "persistently slow Broad institute portal access" from abroad, e.g. China?
  5. Action: reconnect with PDC, to see 
    1. current progress towards v1 alpha release (Fall 2018)
    2. chat about when will ingest occur for CPTAC3 data?

June 11, 2018

  1. Latest data?

  2. Progress report: content due soon

  3. Upcoming site visit: July 17, can mostly come from expanded length/depth of F2F presentation (which was abbreviated to only 10mins)

May 7, 2018

  1. Recap of last week's F2F: 

    1. Good FireCloud workshop attendance & feedback

    2. Proteomic:

    3. Genomic: 

      1. GDC has made progress automating their pipelines

      2. CPTAC genomic data will be HG38 going forward; 

    4. Phased delivery CCRC & UCEC first, to allow publications to be in pipeline by next funding cycle (Gantt chart)

  2. Two special journals in Fall 2018:
    1. Special issue Mol. Cell Proteomics:tools, algorithms & computational methods
      1. Submission deadline will be in November, 2018
    2. Journal of Proteome Research: Software Tools and Resources
      1. Submission deadline September 14, 2018
  3. FOA: Sustained Support for Informatics Resources for Cancer Research and Management (U24)
    1. Due June 14
    2. Submit LOI by May 14
    3. Letters of support: gathering now
  4. Time Permitting:  genomic pathway analyses
    1. GDAN Lung pathifier analyses
    2. iCluster: Hailei

April 2, 2018 

  1. Access to the GDAC bucket for reference files
    1. egress pay: need to turn on requestor_pays bit 
    2. authorization domain
    3. proxy groups and how to keep track of them in a bucket: currently have 2 proxy groups
  2. CPTAC3 data in FC may help entice new CPTAC users, but it is also akin to replicating DCC
    1. so, let's wait until potential CPTAC users make explicit requests
  3. TODO:
    1. Mani will explore using auth domain for new CPTAC FC users
    2. Mike will ping NCI about FireCloud SW session in May F2F
    3. Mike/Sam will price physical hardware & compare to Google VMs, as potential spend for $25K FC disbursement

March 19, 2018

  1. Review draft agenda for May F2F
  2. Decide additional attendees
  3. (Fire)Cloud costs for CPTAC-wide usage:  $25K seems to have effectively been reduced
  4. Batches 1 and 2 of genomic data are located at   /xchip/gdac_data/cptac3/genomic_data_mirror

    1. WashU has apparently re-submitted batch1 RNASeq data, using MapSlice to map transcripts to gene level

    2. So we should be able, in principle, to proceed with our mRNA pipelines

  5. Mike has a short, unifying wrapper to all 3 of the DCC upload/download utilities, and can install to Unix server upon request
  6. Integrative Analyses:
    1. Karsten:  proteomic pathways ... 
    2. Mani: map genomic CN data to LINCs, correlate w/ RNAseq signatures?
    3. Possibly: multi-omics clustering
  7. TODO:
    1. Firecloud workshop scheduling
    2. Mike push cptac wrapper script to Unix servers

March 5, 2018

  1.  Summation of HG19  WashU/GDC workaround & potential recommendations to CPTAC leadership & collaborators.

    1. Consider: combing the GDC website to see if Dockers are available for their pipelines, and whether these could be instantiated in FC
    2. Consider: running local MOAP-style pipeline on WXS data, to generate CN, mutation, RNASeq
      1. Decision:  wait for now, it's not fully baked yet
  2. TODO:  
    1. open edit permissions (on this page) to all viewers
    2. Identify 2-3 integrative proteo-genomic analyses: but must be on CPTAC3 data
    3. Combined into iCoMut output
  3. Planning F2F in May 1,2,3:
    1. Quilts in FC?
    2. Although CustomProDB (from Baylor/Bing Zhang group) does similar as Quilts and is already in FC (from Karsten)
    3. FireCloud workshop
    4. Karsten & Mani: current instantiation of proteomic pipeline 
      1. On prospective BRCA data

Update to Jan 22, 2018 entry:

  1. Genomic data for CPTAC3 downloaded to:  /xchip/gdac_data/cptac3/2018_02_02_genomic_data

  2. Only 2 cohorts (CRCC and UCEC, i.e. kidney and endometrial) have genomic data available so far
  3. The 3rd cohort (LUAD, lung) proteomic data not submitted by Broad yet, so WashU has apparently not processed the genomic either

Jan 22, 2018

  1. Brief review of items missed from last meeting
  2. Proteogenomic Data Commons Steering Committee: 
    1. Held 2nd advisory meeting call last Wed
  3. WashU/GDC workaround: summary of discussion & decisions from 1/19 call

  4. New science: degradome?

Jan 8, 2018

  1. Welcome Yifat Geffen, newest member of CGA
  2. Brief review of latest suite of genomic run reports (total of 830)
  3. Whither pathology image browser in CPTAC?  The GTEX pathology browser was authored here (and we have strong knowledge of cancer path viewer), so we have a good deal of expertise & code that could in principle be leveraged.  I've drafted a suggestion for PAAD dwg here.
  4. NMF clustering module question (auto-selection of K) from Mani?
  5. FireCloud hosting of CPTAC data (as partial workaround to lack of CPTAC genomic data at GDC)
  6. Medblast paper

Dec 11, 2017

  1.  Items from 11/27 meeting that was cancelled
  2. GDC and CPTAC:  summary notes from week of  2017_12_06
    1. Original plan (and data products) given here
    2. Impact to CGDAC (the CGA part of proteomics GDAC) sketched below
    3. Initial data generation will be shifting from GDC to WashU
          Mutation calls (both WES and WGS)
          CN generation
          RNAseq calls
      WashU products deposited to Georgetown DCC
      Broad download & remap names as needed/appropriate

    4. What's next?
      FireCloud (as a trusted partner) now being considered as a distribution point
      Per Chris Kinsinger feeler conversation on 2017_12_01

    5. So, because these data will be HG19 ... our CGA/GDAC in CPTAC may be better utilized by shifting gears, from running existing FireCloud HG38 genomic pipelines on HG19 data (which lead to broken results) ... to loading these HG19 data products from WashU into FireCloud so that it can serve as a distribution point

    6. Side Q: why Georgetown DCC not considered for this? Scale? Absence of trusted partner status?

  3. Status on $25K to fund use of FireCloud across entire CPTAC?
    1. any progress: NO, there was an attempt to issue as AWS credits ... currently stuck w/r/t GoogleCredits ... stay tuned
    2. billing project?

Nov 27, 2017

  1.  Timeline for LUAD, UCEC and KIRC projects:  given here

Oct 30, 2017: tentative agenda

  1. Discuss CPTAC-wide use of FireCloud: how to allot funding, make billing projects, add users etc
  2. Recall supervisor modein FISSFC:
  3. Update on DSDE collaboration:
    1. Show recent CGA/DSDE collaboration proposal
    2. FISS backbone of Jupyter notebooks in FireCloud
    3. Code generator progress:
      1. standalone tool
      2. works on GTEX
      3. Swagger2 / FireCloud proof of concept has been done
      4. Full Swagger2 support is next
  4. Discussion for Wed 11/1 AWG telecon:
    1. Thoughts omitted from F2F talk, for time constraint: slides 21-39
  5. Chet: recent CPTAC2 workspaces ... where to go next?

July 2017:  FYI on proteomics deliverables from FireCloud CGA team

Per Chet Birger/D.R. Mani meeting:
  • FireCloud data workspaces
    • one (possibly two - see below) for each of the three CPTAC AWGs (breast, ovarian, colon)
    • The workspaces will contain, at a minimum, the end results (protein level quantification) produced by each AWG.
    • We may also include the raw MS files, and/or the standardized mzML files.  But all of the pipelines used for analyzing these files rely on windows-based software, and so cannot be run on FireCloud. 
    • We will include the TCGA genomic, clinical and biospecimen data as well - this will help researchers who want to conduct correlative analyses.  It will mean, however, that we'll want to create both open and controlled access versions of these workspaces, as the BAMs and VCF files are controlled access. 
    • We may also include the outputs of the CDAP pipeline, which are published on the CPTAC data portal.  
    • We will aim to get these workspace in place by the end of August
  • Workflows
    • Since all of the workflows that run on either the raw MS files or the mzML files (CDAP) include windows-based tasks, they cannot be run on firecloud.
    • Mani and Mike's teams are developing workflows for correlative analysis; we agreed to touch base with them at the end of August to see how far along any of these pipelines are and whether they could be included in our deliverables.  If not, so be it....I'm hoping that NCI will see the value in the data workspaces for the future development of workflows.

May 31, 2017 On-Site (Broad Institute, Cambridge MA)

  • Mike's slides:  here

April 4, 2017 Face-to-Face (Bethesda, MD)

  • Mike's slides:  here