Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

The problem

...

Code Block
languagesql
SELECT a.run_name, a.cell_well, c.*
FROM pacbio a,
json_table(DATA, '$[*]'
COLUMNS(
 "DNABarcode"           PATH '$.DNABarcode',
 "BioSample"            PATH '$.BioSample',
 "HiFi Reads"           PATH '$.attributes[*]?(@.id=="ccs2.number_of_ccs_reads").value',
 "HiFi Yield (bp)"      NUMBER   PATH '$.attributes[*]?(@.id=="ccs2.total_number_of_ccs_bases").value',
 "HiFi Read Length (mean, bp)"  NUMBER          PATH '$.attributes[*]?(@.id=="ccs2.mean_ccs_readlength").value',
 "HiFi Read Quality (median) accuracy"          PATH '$.attributes[*]?(@.id=="ccs2.median_accuracy").value',
 "HiFi Read Quality (median)"   NUMBER         PATH '$.attributes[*]?(@.id=="ccs2.median_qv").value'
)) AS c
WHERE site_id=3 AND a.domain='CROMWELL/sl_dataset_reports/*/call-import_dataset_reports/execution/ccs.report.json*''
AND a.run_name='r64386e_20220523_180557' AND a.cell_well='4_D01'

Metrics stored in “JSON-tables”

Bunch of interesting metrics (for example ccs2.hifi_length_summary.read_length) are stored in JSON-”tables”. Unfortunately they are organized in “column-based” fashion making it nearly impossible to extract metrics from DB later on. Therefore a new synthetic twin tables are created where metrics are organized in “row-based” fashion (in other words things are “transposed”)

...

As a result, straightforward JSON-extraction becomes possible

...

Technical caveats

  • This framework is tightly coupled to PacBio’s internal file-structure (unfortunately and inevitably). So, next time PacBio change their SMRTLink version, this solution may have to be fixed accordingly.

  • All metrics stored in PACBIO datamart are in JSON format. Metrics in XML files are converted into JSON

  • for each digested metrics file, a special “domain” field is generated - it allows for similar metrics to be grouped and queried via SQL later on

  • examples shown are for v11 installation on “sodium”. Once “skywalker” is operational switch over should be relatively easy.

  • ANALYTICS.PACBIO datamart (along with relevant views) is located in this Oracle instance

    Code Block
    db.analytics.url="jdbc:oracle:thin:@//seqprod.broadinstitute.org:1521/seqprod.broadinstitute.org"

    username: REPORTING

  • "ANALYTICS.PACBIO_STAR" view demonstrates how to merge together multiple files (ccs_report, loading, etc) in a flat per (run,cell_well) datasource. It is based on SmrtLink v10, hydrogen data (site_id=1) but techniques used are 100% legit.

  • Surgically extract fields from metrics-JSON via Oracle JSON

  • progress of Sodium PacBio flattened metrics ETL can be checked here ETL dashboard

  • “per-barcode” metrics are supported by converting multiple “consensusreadset.xml“ files into JSONs and then merging these into a single “synthetic JSON-array“. These can be recognized by checking for trailing “*” at the end of “domain” field.

...