Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

The problem

...

The “cromwell” metrics however are particularly painful since the only way to link them back to (run, cellWell) is to track down the “symbolicLink” in relevant “inputs” folder riddled with random UUIDs all along. This requires fair amount of linux voodoo magic which significantly slows down new development.

...

Mission statement

It would be great if all teams (Analytics, lab, DSP, Mercury, etc) can query the metrics from our PACBIO datamart in streamlined way. Software engineers would merely use SQL/XML to extract fields they need in very declarative way.

...

  • All metrics stored in PACBIO datamart are in JSON format. Metrics in XML files are converted into JSON

  • for each digested metrics file, a special “domain” field is generated - it allows for similar metrics to be grouped and queried via SQL later on

  • examples shown are for v11 installation on “sodium”. Once “skywalker” is operational switch over should be relatively easy.

  • ANALYTICS.PACBIO datamart (along with relevant views) is located in this Oracle instance

    Code Block
    db.analytics.url="jdbc:oracle:thin:@//seqprod.broadinstitute.org:1521/seqprod.broadinstitute.org"

    username: REPORTING

  • "ANALYTICS.PACBIO_STAR" view demonstrates how to merge together multiple files (ccs_report, loading, etc) in a flat per (run,cell_well) datasource. It is based on SmrtLink v10, hydrogen data (site_id=1) but techniques used are 100% legit.

  • Surgically extract fields from metrics-JSON via Oracle JSON

  • progress of Sodium PacBio flattened metrics ETL can be checked here ETL dashboard

  • “per-barcode” metrics are supported by converting multiple “consensusreadset.xml“ files into JSONs and then merging these into a single “synthetic JSON-array“. These can be recognized by checking for trailing “*” at the end of “domain” field.

...