Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

The problem

...

Code Block
SELECT a.run_name, a.cell_well, "etl.dataset", c."rowid", 
    REPLACE(c."Read Length (bp)", CHR(191), '>=') "Read Length (bp)", -- '>=' UTF8 e2 89 a5
    "Reads", "Reads (%)" ,"Yield (bp)", "Yield (%)" 
FROM pacbio a,
json_table(DATA, '$[*]'
COLUMNS(
    "etl.dataset" path '$."etl.dataset"',
    NESTED PATH "etl.ccs2.hifi_length_summary"[*] COLUMNS(
        "rowid"  PATH '$.rowid',
        "Read Length (bp)"  PATH '$."Read Length (bp)"',
        "Reads"             NUMBER PATH '$.Reads',
        "Reads (%)"         NUMBER PATH '$."Reads (%)"',
        "Yield (bp)"        NUMBER PATH '$."Yield (bp)"',
        "Yield (%)"         NUMBER PATH '$."Yield (%)"'
    )
)) AS c
WHERE site_id=3 AND a.domain='CROMWELL/sl_dataset_reports/*/call-import_dataset_reports/execution/ccs.report.json*' 
AND a.run_name='r64020e_20220519_191246' AND a.cell_well='1_B01'

...

...

Metrics stored in “attributes“ JSON-array

Other metrics are stored in “attributes” JSON-array (on the left side). A new synthetic “etl.attributes“ JSON-object is added to allow more natural JSON-extraction from the DB.

...

Code Block
SELECT a.run_name, a.cell_well, "etl.dataset", "HiFi Reads", "HiFi Yield (bp)", "HiFi Read Length (mean, bp)"
FROM pacbio a,
json_table(DATA, '$[*]'
COLUMNS(
    "HiFi Reads"                    NUMBER PATH '$."etl.attributes"."ccs2.number_of_ccs_reads".value',
    "HiFi Yield (bp)"               NUMBER PATH '$."etl.attributes"."ccs2.total_number_of_ccs_bases".value',
    "HiFi Read Length (mean, bp)"   NUMBER PATH '$."etl.attributes"."ccs2.mean_ccs_readlength".value',
    "etl.dataset" path '$."etl.dataset"'
)) AS c
WHERE site_id=3 AND a.domain='CROMWELL/sl_dataset_reports/*/call-import_dataset_reports/execution/ccs.report.json*' 
AND a.run_name='r64020e_20220519_191246' AND a.cell_well='1_B01'

...

Technical caveats

  • This framework is tightly coupled to PacBio’s internal file-structure (unfortunately and inevitably). So, next time PacBio change their SMRTLink version, this solution may have to be fixed accordingly.

  • All metrics stored in PACBIO datamart are in JSON format. Metrics in XML files are converted into JSON

  • for each digested metrics file, a special “domain” field is generated - it allows for similar metrics to be grouped and queried via SQL later on

  • examples shown are for v11 installation on “sodium”. Once “skywalker” is operational switch over should be relatively easy.

  • ANALYTICS.PACBIO datamart (along with relevant views) is located in this Oracle instance

    Code Block
    db.analytics.url="jdbc:oracle:thin:@//seqprod.broadinstitute.org:1521/seqprod.broadinstitute.org"

    username: REPORTING

  • "ANALYTICS.PACBIO_STAR" view demonstrates how to merge together multiple files (ccs_report, loading, etc) in a flat per (run,cell_well) datasource. It is based on SmrtLink v10, hydrogen data (site_id=1) but techniques used are 100% legit.

  • Surgically extract fields from metrics-JSON via Oracle JSON

  • progress of Sodium PacBio flattened metrics ETL can be checked here ETL dashboard

  • “per-barcode” metrics are supported by converting multiple “consensusreadset.xml“ files into JSONs and then merging these into a single “synthetic JSON-array“. These can be recognized by checking for trailing “*” at the end of “domain” field.

...