Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Version History

« Previous Version 9 Next »

Motivation

Historically, the way Analytics gets qc-metrics has been by “pulling” the files from file system, parsing fields and load them into appropriate tables. This method has few disadvantages:

  • files must be made accessible to “analytics” VM. Often this requires respective teams to be manually moving files into “Tableau_Files” shared folder on regular basis

  • each filetype (vvp, smartseq,etc) requires different parsing logic and dedicated table

  • often imperfect data (like NaN, malformed timestamps) is able to derail entire ETL pipeline

Wouldn’t it be great to have a generic “Metrics web-service + FilePush“ solution where:

  • qc-files can be left at their original location (no copy required)

  • file-system sharing is not required, tcp/ip is enough (so that if Oppy needs to send some qc-metrics from Mars it would be able to do so)

  • no need to create new customized tables for each application - everything (vvp, smartes, etc) would be transformed into JSON and stored in METRICS table

  • resilience - at most imperfect data would lead to individual fields showing up as blank

  • straightforward JSON extraction in Tableau by using Oracle’s JSON constructs

Setup

  1. Register your new application (for example “vvp”) with the “Metrics” service (contact Atanas Mihalev) and specify if files are “csv” or “tsv”. You will be assigned a “SITE_ID” for your app.

  2. setup a “FilePusher” solution to push the files on regular basis

What is “FilePusher”

It’s a tiny Scala/Java app which wakes up on some schedule, grabs the new “delta” files and push them to “Metrics” web service over tcp/ip. It can be deployed wherever your files are.

check if your server has tcp/ip connectivity to “analytics” VM

ping analytics.broadinstitute.org

make sure you have Java installed.

java -version

You might have to specify the full path to your “java” executable if PATH-variable is not updated.
Tool has been tested with Java8.

Copy the “remote-assembly-1.0.jar" (provided by analytics) locally on your server

Try to “push” one file like this (make sure to adjust parameters for your application)

/broad/software/free/Linux/redhat_6_x86_64/pkgs/jdk1.8.0_121/bin/java -cp /home/unix/analytics/TigerETL3/remote-assembly-1.0.jar analytics.tiger.remote.FilePusher 'dryrun=false' 'url=https://analytics.broadinstitute.org/Metrics?type=vvp' 'filepathRegex=(.*)/(.*AspAllOutputQC.csv)' 'delta=FILES_CSV /seq/tableau_files/VVPVolumeQC/20221205_RACK_QC_083303_AspAllOutputQC.csv'

you should get output like this

--- PARAMS START ---
dryrun = false
url = https://analytics.broadinstitute.org/Metrics?type=vvp
filepathRegex = (.*)/(.*AspAllOutputQC.csv)
delta = FILES_CSV /seq/tableau_files/VVPVolumeQC/20221205_RACK_QC_083303_AspAllOutputQC.csv
--- PARAMS END ---

Upload started
FILE: /seq/tableau_files/VVPVolumeQC/20221205_RACK_QC_083303_AspAllOutputQC.csv
outer_columns INFERENCE: List($0=/seq/tableau_files/VVPVolumeQC, $1=20221205_RACK_QC_083303_AspAllOutputQC.csv)
COMMAND: curl -F file="@/seq/tableau_files/VVPVolumeQC/20221205_RACK_QC_083303_AspAllOutputQC.csv" "https://analytics.broadinstitute.org/Metrics?type=vvp"
UPLOAD STARTS
OK
Map(type -> vvp, filename -> 20221205_RACK_QC_083303_AspAllOutputQC.csv)
plugin found: analytics.tiger.web.Metrics$$anon$5
20221205_RACK_QC_083303_AspAllOutputQC.csv => OK
---------------------------------

Upload ended
1 files processed

“Metrics” web-service accepts the CSV/TSV file, converts it into JSON and store it in METRICS datamart

Go to ANALYTICS on SEQPROD db and check if your file/metrics are in METRICS datamart

SELECT 
    a.site_id, 
    a.filename,
    --
    "Position",
    "V1" 
FROM metrics a ,
json_table(DATA, '$.records[*]'
COLUMNS(
    "Position"   PATH '$.Position',
    "V1"        NUMBER PATH '$.V1'
)) AS c
WHERE a.site_id =1

“FilePusher” can be customized using these parameters:

  • url=https://analytics.broadinstitute.org/Metrics?type=vvp
    where the file should be sent to

  • filepathRegex=(.*)/(.*AspAllOutputQC.csv)
    which file-names should be picked

  • outer_columns=field1=$1,field2=$2,field3=XYZ
    parse fields out of the filepathRegex and injects them in the JSON (useful when bits of data is encoded in the filename)

  • dryrun=true
    run all the delta capturing and regex-parsing and show the data without actually pushing the file

  • delta=<delta specification>
    specifies ”how” files to be picked

    • pick 1 (or multiple) specific files
      delta=FILES_CSV /seq/tableau_files/VVPVolumeQC/20221205_RACK_QC_083303_AspAllOutputQC.csv

    • pick files timestamped between this and that timestamp
      delta=FILES IN FOLDER /seq/tableau_files/VVPVolumeQC TIMESTAMPED BETWEEN 2019-10-19 13:11:46 AND 2019-10-19 13:11:46

    • pick files timestamped in last 10 hours
      delta=FILES IN FOLDER /seq/tableau_files/VVPVolumeQC TIMESTAMPED BETWEEN -10h AND NOW

    • pick files timestamped between persisted-in-file-timestamp and now (production setup)
      delta=FILES IN FOLDER /seq/tableau_files/VVPVolumeQC TIMESTAMPED BETWEEN /seq/tableau_files/VVPVolumeQC/VVP_etl_timestamp.txt AND NOW
      The file merely contains a timestamp(2019-10-19 13:11:46) and should be created manually

For testing purposes, an individual file can also be manually pushed by using curl:
on Linux

curl -F file="@/seq/tableau_files/VVPVolumeQC/20221205_RACK_QC_083303_AspAllOutputQC.csv" "https://analytics.broadinstitute.org/Metrics?type=vvp"

on Windows

curl -F file="@c:\myfolder\20221205_RACK_QC_083303_AspAllOutputQC.csv" "https://analytics.broadinstitute.org/Metrics?type=vvp"

Final production-grade command would look like this

/broad/software/free/Linux/redhat_6_x86_64/pkgs/jdk1.8.0_121/bin/java -cp /home/unix/analytics/TigerETL3/remote-assembly-1.0.jar analytics.tiger.remote.FilePusher 'dryrun=false' 'url=https://analytics.broadinstitute.org/Metrics?type=vvp' 'filepathRegex=(.*)/(.*AspAllOutputQC.csv)' 'delta=FILES IN FOLDER /seq/tableau_files/VVPVolumeQC TIMESTAMPED BETWEEN /seq/tableau_files/VVPVolumeQC/VVP_etl_timestamp.txt AND NOW'

  • No labels