Page Comparison

The problem

...

Code Block

language	sql

SELECT a.run_name, a.cell_well, c.*
FROM pacbio a,
json_table(DATA, '$[*]'
COLUMNS(
 "DNABarcode"           PATH '$.DNABarcode',
 "BioSample"            PATH '$.BioSample',
 "HiFi Reads"           PATH '$.attributes[*]?(@.id=="ccs2.number_of_ccs_reads").value',
 "HiFi Yield (bp)"      NUMBER   PATH '$.attributes[*]?(@.id=="ccs2.total_number_of_ccs_bases").value',
 "HiFi Read Length (mean, bp)"  NUMBER          PATH '$.attributes[*]?(@.id=="ccs2.mean_ccs_readlength").value',
 "HiFi Read Quality (median) accuracy"          PATH '$.attributes[*]?(@.id=="ccs2.median_accuracy").value',
 "HiFi Read Quality (median)"   NUMBER         PATH '$.attributes[*]?(@.id=="ccs2.median_qv").value'
)) AS c
WHERE site_id=3 AND a.domain='CROMWELL/sl_dataset_reports/*/call-import_dataset_reports/execution/ccs.report.json*''
AND a.run_name='r64386e_20220523_180557' AND a.cell_well='4_D01'

Metrics stored in “JSON-tables”

Bunch of interesting metrics (for example ccs2.hifi_length_summary.read_length) are stored in JSON-”tables”. Unfortunately they are organized in “column-based” fashion making it nearly impossible to extract metrics from DB later on. Therefore a new synthetic twin tables are created where metrics are organized in “row-based” fashion (in other words things are “transposed”)

...

As a result, straightforward JSON-extraction becomes possible

...

Technical caveats

This framework is tightly coupled to PacBio’s internal file-structure (unfortunately and inevitably). So, next time PacBio change their SMRTLink version, this solution may have to be fixed accordingly.
All metrics stored in PACBIO datamart are in JSON format. Metrics in XML files are converted into JSON
for each digested metrics file, a special “domain” field is generated - it allows for similar metrics to be grouped and queried via SQL later on
examples shown are for v11 installation on “sodium”. Once “skywalker” is operational switch over should be relatively easy.
ANALYTICS.PACBIO datamart (along with relevant views) is located in this Oracle instance
Code Block
db.analytics.url="jdbc:oracle:thin:@//seqprod.broadinstitute.org:1521/seqprod.broadinstitute.org"
username: REPORTING
"ANALYTICS.PACBIO_STAR" view demonstrates how to merge together multiple files (ccs_report, loading, etc) in a flat per (run,cell_well) datasource. It is based on SmrtLink v10, hydrogen data (site_id=1) but techniques used are 100% legit.
Surgically extract fields from metrics-JSON via Oracle JSON
progress of Sodium PacBio flattened metrics ETL can be checked here ETL dashboard
“per-barcode” metrics are supported by converting multiple “consensusreadset.xml“ files into JSONs and then merging these into a single “synthetic JSON-array“. These can be recognized by checking for trailing “*” at the end of “domain” field.

...

Version	Old Version 6	New Version 7
Changes made by	Atanas Mihalev	Atanas Mihalev
Saved on	2022-06-17	2022-06-21

Versions Compared

Key

The problem

Metrics stored in “JSON-tables”

Technical caveats