Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Table of Contents
minLevel1
maxLevel7

...

Code Block
SELECT a.run_name, a.cell_well, a.movie, c."raw_data_report.insert_n50"
FROM pacbio a,
json_table(DATA, '$[*]'
COLUMNS(
 "raw_data_report.insert_n50"  NUMBER PATH '$."etl.attributes"."raw_data_report.insert_n50".value'
)) AS c
WHERE site_id=3 AND a.domain='CROMWELL/sl_dataset_reports/*/call-import_dataset_reports/execution/raw_data.report.json'
AND a.run_name='r64386e_20220523_180557' AND a.cell_well='4_D01'

“per-bacrode” support

“per-barcode” metrics are supported by converting multiple “consensusreadset.xml“ files into JSONs and then merging these into a single “synthetic JSON-array“. These can be recognized by checking for trailing “*” at the end of “domain” field.

...

For a given domain, if ETL comes across multiple files then it will naturally merge these into JSON-array.
However this logic is not sufficient if there is only 1 barcode registered per cell - therefore a list of exemption file-types (ccs.report.json) is kept to instruct the ETL to always merge these into JSON-array regardless of number of files.

Technical caveats

  • This framework is tightly coupled to PacBio’s internal file-structure (unfortunately and inevitably). So, next time PacBio change their SMRTLink version, this solution may have to be fixed accordingly.

  • All metrics stored in PACBIO datamart are in JSON format. Metrics in XML files are converted into JSON

  • for each digested metrics file, a special “domain” field is generated - it allows for similar metrics to be grouped and queried via SQL later on

  • examples shown are for v11 installation on “sodium”. Once “skywalker” is operational switch over should be relatively easy.

  • ANALYTICS.PACBIO datamart (along with relevant views) is located in this Oracle instance

    Code Block
    db.analytics.url="jdbc:oracle:thin:@//seqprod.broadinstitute.org:1521/seqprod.broadinstitute.org"

    username: REPORTING

  • "ANALYTICS.PACBIO_STAR" view demonstrates how to merge together multiple files (ccs_report, loading, etc) in a flat per (run,cell_well) datasource. It is based on SmrtLink v10, hydrogen data (site_id=1) but techniques used are 100% legit.

  • Surgically extract fields from metrics-JSON via Oracle JSON

  • progress of Sodium PacBio flattened metrics ETL can be checked here ETL dashboard

  • rollback-protection is implemented so that ETL-run is cancelled if seen-before files are removed

    “per-barcode” metrics are supported by converting multiple “consensusreadset.xml“ files into JSONs and then merging these into a single “synthetic JSON-array“. These can be recognized by checking for trailing “*” at the end of “domain” field.

    Image Removed