Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Table of Contents
minLevel1
maxLevel7

...

Code Block
SELECT a.run_name, a.cell_well, "etl.dataset", c."rowid", 
    c."Read Length (bp)" "Read Length (bp) RAW",    -- for exploration purposes only 
    --REPLACE(c."Read Length (bp)", CHR(191), '>=') "Read Length (bp)", -- '>=' UTF8 e2 89 a5
    "Reads", "Reads (%)" ,"YieldDECODE(rawtohex(c."Read Length (bp)"),
"Yield (%)"    FROM pacbio a, json_table(DATA, '$[*]'
COLUMNS(BF2030'         "etl.dataset" path, '$."etl.dataset">= 0',
    NESTED PATH '$."etl.ccs2.hifi_length_summary"[*]' COLUMNS(
    'BF20352C303030'  , '>= 5000',
     "rowid"   PATH'BF2031302C303030', '$.rowid>= 10000', 
       "Read Length (bp)"  'BF2031352C303030', '>= 15000',
       PATH '$."ccs2.hifi_length_summary.read_length"BF2032302C303030', '>= 20000',
      "Reads"  'BF2032352C303030', '>= 25000',
        NUMBER PATH '$."ccs2.hifi_length_summary.n_reads"'BF2033302C303030', '>= 30000',
        "Reads (%)" 'BF2033352C303030', '>= 35000',
       NUMBER PATH '$."ccs2.hifi_length_summary.reads_pct"'BF2034302C303030', '>= 40000',
        "Yieldrawtohex(c."Read Length (bp)")  -- catch everything else
   NUMBER PATH) '$."ccs2.hifi_length_summary.yield"'"Read Length (bp)",
    "Reads", "Reads (%)" ,"Yield (bp)", "Yield (%)" 
FROM pacbio a,
    NUMBERjson_table(DATA, '$[*]'
COLUMNS(
    "etl.dataset" path '$."etl.dataset"',
    NESTED PATH '$."etl.ccs2.hifi_length_summary.yield_pct""[*]' COLUMNS(
    ) )) AS c WHERE site_id=3 AND a.domain='CROMWELL/sl_dataset_reports/*/call-import_dataset_reports/execution/ccs.report.json*' 
AND a.run_name='r64020e_20220519_191246' AND a.cell_well='1_B01'

...

"rowid"  PATH '$.rowid',
        "Read Length (bp)"         PATH '$.read_length',
        "Reads"             NUMBER PATH '$.n_reads',
        "Reads (%)"         NUMBER PATH '$.reads_pct',
        "Yield (bp)"        NUMBER PATH '$.yield',
        "Yield (%)"         NUMBER PATH '$.yield_pct'
    )
)) AS c
WHERE site_id=6 AND a.domain='CROMWELL/sl_dataset_reports/*/call-import_dataset_reports/execution/ccs.report.json*'
--AND rawtohex(c."Read Length (bp)") = 'BF2033302C303030' -- filter bucket >= 30000
AND a.run_name='r64218e_20221021_195314' AND a.cell_well='2_B01'

...

Keep in mind that UTF8 characters (like ‘>=’) - nicely rendered in Chrome - may have variable-length bytes representation and therefore Oracle’s rawtohex function is necessary.
https://www.utf8-chartable.de/unicode-utf8-table.pl?start=8704&number=128&names=-

REPLACE(c."Read Length (bp)", CHR(191), ‘>=') "Read Length (bp)", -- '>=' UTF8 e2 89 a5
seems to do the trick but DECODE expression gives you more control.

UPDATE: this is even less cryptic way to deal with non UTF-8 characters

Code Block
ASCIISTR("HiFi LenSum read_length") = '\00BF 10,000'

Metrics stored in “attributes“ JSON-array

...

“per-barcode” metrics are supported by converting multiple “consensusreadset.xml“ files into JSONs and then merging these into a single “synthetic JSON-array“. These can be recognized by checking for trailing “*” at the end of “domain” field.

...

This is special domain derived from the “apiRoot:/runs/UUID” API, where the “dataModel” field is extracted (turns out it’s an XML), converted into JSON and recorded in PACBIO datamart as “API/runDataModel“ domain. This data is also available in the “DATAROOT/*/*/*.run.metadata.xml“ domain however it would show up there later when cell “movies” start, etc.

...

“API/runDataModel/RecordedEvents” domain

Bunch of intrihuing “recorded events” were unearthed from PacBio’s dataModel. These are captured into the new “API/runDataModel/RecordedEvents” domain.

...

Particularly interesting is the "AcquisitionInitializeInfo" event which apparently provides "reagent info" among others (see below)

...

How files are scraped from the file system - the linux voodoo magic

...

  • API-domains are derived via API-calls which appear sensitive to reinstalls. So, API-domains in SODIUM from before Jun-2022 are not available due to SMRTLink reinstall.

  • Not all workflows are triggered for all runs (for example cromwell ones). You might have to OUTER JOIN things to deal with this uncertainty.

  • This framework is tightly coupled to PacBio’s internal file-structure (unfortunately and inevitably). So, next time PacBio change their SMRTLink version, this solution may have to be fixed accordingly.

  • All metrics stored in PACBIO datamart are in JSON format. Metrics in XML files are converted into JSON

  • for each digested metrics file, a special “domain” field is generated - it allows for similar metrics to be grouped and queried via SQL later on

  • examples shown are for v11 installation on “sodium”. Once “skywalker” is operational switch over should be relatively easy.

  • ANALYTICS.PACBIO datamart (along with relevant views) is located in this Oracle instance

    Code Block
    db.analytics.url="jdbc:oracle:thin:@//seqprod.broadinstitute.org:1521/seqprod.broadinstitute.org"

    username: REPORTING

  • "ANALYTICS.PACBIO_STAR" view demonstrates how to merge together multiple files (ccs_report, loading, etc) in a flat per (run,cell_well) datasource. It is based on SmrtLink v10, hydrogen data (site_id=1) but techniques used are 100% legit.

  • Surgically extract fields from metrics-JSON via Oracle JSON

  • progress of Sodium PacBio flattened metrics ETL can be checked here ETL dashboard

  • rollback-protection is implemented so that ETL-run is cancelled if seen-before files are removed