Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

The problem

...

Code Block
SELECT a.run_name, a.cell_well, "etl.dataset", c."rowid", 
    REPLACE(c."Read Length (bp)", CHR(191), '>=') "Read Length (bp)", -- '>=' UTF8 e2 89 a5
    "Reads", "Reads (%)" ,"Yield (bp)", "Yield (%)" 
FROM pacbio a,
json_table(DATA, '$[*]'
COLUMNS(
    "etl.dataset" path '$."etl.dataset"',
    NESTED PATH '$."etl.ccs2.hifi_length_summary"[*]' COLUMNS(
        "rowid"  PATH '$.rowid',
        "Read Length (bp)"         PATH '$."Read Length (bp)ccs2.hifi_length_summary.read_length"',
        "Reads"             NUMBER PATH '$.Reads"ccs2.hifi_length_summary.n_reads"',
        "Reads (%)"         NUMBER PATH '$."Reads (%)ccs2.hifi_length_summary.reads_pct"',
        "Yield (bp)"        NUMBER PATH '$."Yield (bp)ccs2.hifi_length_summary.yield"',
        "Yield (%)"         NUMBER PATH '$."Yield (%)ccs2.hifi_length_summary.yield_pct"'
    )
)) AS c
WHERE site_id=3 AND a.domain='CROMWELL/sl_dataset_reports/*/call-import_dataset_reports/execution/ccs.report.json*' 
AND a.run_name='r64020e_20220519_191246' AND a.cell_well='1_B01'

...

  • This framework is tightly coupled to PacBio’s internal file-structure (unfortunately and inevitably). So, next time PacBio change their SMRTLink version, this solution may have to be fixed accordingly.

  • All metrics stored in PACBIO datamart are in JSON format. Metrics in XML files are converted into JSON

  • for each digested metrics file, a special “domain” field is generated - it allows for similar metrics to be grouped and queried via SQL later on

  • examples shown are for v11 installation on “sodium”. Once “skywalker” is operational switch over should be relatively easy.

  • ANALYTICS.PACBIO datamart (along with relevant views) is located in this Oracle instance

    Code Block
    db.analytics.url="jdbc:oracle:thin:@//seqprod.broadinstitute.org:1521/seqprod.broadinstitute.org"

    username: REPORTING

  • "ANALYTICS.PACBIO_STAR" view demonstrates how to merge together multiple files (ccs_report, loading, etc) in a flat per (run,cell_well) datasource. It is based on SmrtLink v10, hydrogen data (site_id=1) but techniques used are 100% legit.

  • Surgically extract fields from metrics-JSON via Oracle JSON

  • progress of Sodium PacBio flattened metrics ETL can be checked here ETL dashboard

  • “per-barcode” metrics are supported by converting multiple “consensusreadset.xml“ files into JSONs and then merging these into a single “synthetic JSON-array“. These can be recognized by checking for trailing “*” at the end of “domain” field.

...