/
FilePusher - push tsv/csv metrics-files to Analytics

FilePusher - push tsv/csv metrics-files to Analytics

Motivation

Historically, the way Analytics gets qc-metrics has been by “pulling” the files from file system, parsing fields and load them into appropriate tables. This method has few disadvantages:

  • files must be made accessible to “analytics” VM. Often this requires respective teams to be manually moving files into “Tableau_Files” shared folder on regular basis

  • each filetype (vvp, smartseq,etc) requires different parsing logic and dedicated table

  • often imperfect data (like NaN, malformed timestamps) is able to derail entire ETL pipeline

Wouldn’t it be great to have a generic “Metrics web-service + FilePush“ solution where:

  • qc-files can be left at their original location (no copy required)

  • file-system sharing is not required, tcp/ip is enough (so that if Oppy needs to send some qc-metrics from Mars it would be able to do so)

  • no need to create new customized tables for each application - everything (vvp, smartseq, etc) would be transformed into JSON and stored in METRICS table

  • resilience - at most imperfect data would lead to individual fields showing up as blank

  • straightforward JSON extraction in Tableau by using Oracle’s JSON constructs

Setup

  1. Register your new application (for example “vvp”) with the “Metrics” service (contact @Atanas Mihalev) and specify if files are “csv” or “tsv”. You will be assigned a “SITE_ID” for your app.

  2. setup a “FilePusher” solution to push the files on regular basis

What is “FilePusher”

It’s a tiny Scala/Java app which wakes up on some schedule, grabs the new “delta” files and push them to “Metrics” web service over tcp/ip. It can be deployed wherever your files are.

check if your server has tcp/ip connectivity to “analytics” VM

ping analytics.broadinstitute.org

make sure you have Java installed.

java -version

You might have to specify the full path to your “java” executable if PATH-variable is not updated.
Tool has been tested with Java8.

Copy the “remote-assembly-1.0.jar" (provided by analytics) locally on your server

Try to “push” one file like this (make sure to adjust parameters for your application)

/broad/software/free/Linux/redhat_6_x86_64/pkgs/jdk1.8.0_121/bin/java -cp /home/unix/analytics/TigerETL3/remote-assembly-1.0.jar analytics.tiger.remote.FilePusher 'dryrun=false' 'url=https://analytics.broadinstitute.org/Metrics?type=vvp' 'filepathRegex=(.*)/(.*AspAllOutputQC.csv)' 'delta=FILES_CSV /seq/tableau_files/VVPVolumeQC/20221205_RACK_QC_083303_AspAllOutputQC.csv'

you should get output like this

 

“Metrics” web-service accepts the CSV/TSV file, converts it into JSON and store it in METRICS datamart

 

Go to ANALYTICS on SEQPROD db and check if your file/metrics are in METRICS datamart

“FilePusher” can be customized using these parameters:

  • url=https://analytics.broadinstitute.org/Metrics?type=vvp
    where the file should be sent to

  • filepathRegex=(.*)/(.*AspAllOutputQC.csv)
    which file-names should be picked

  • outer_columns=field1=$1,field2=$2,field3=XYZ
    parse fields out of the filepathRegex and injects them in the JSON (useful when bits of data is encoded in the filename)

  • dryrun=true
    run all the delta capturing and regex-parsing and show the data without actually pushing the file

  • delta=<delta specification>
    specifies ”how” files to be picked

    • pick 1 (or multiple) specific files
      delta=FILES_CSV /seq/tableau_files/VVPVolumeQC/20221205_RACK_QC_083303_AspAllOutputQC.csv

    • pick files timestamped between this and that timestamp
      delta=FILES IN FOLDER /seq/tableau_files/VVPVolumeQC TIMESTAMPED BETWEEN 2019-10-19 13:11:46 AND 2019-10-19 13:11:46

    • pick files timestamped in last 10 hours
      delta=FILES IN FOLDER /seq/tableau_files/VVPVolumeQC TIMESTAMPED BETWEEN -10h AND NOW

    • pick files timestamped between persisted-in-file-timestamp and now (production setup)
      delta=FILES IN FOLDER /seq/tableau_files/VVPVolumeQC TIMESTAMPED BETWEEN /seq/tableau_files/VVPVolumeQC/VVP_etl_timestamp.txt AND NOW
      The file merely contains a timestamp(2019-10-19 13:11:46) and should be created manually

    • adjust delta so that you can mitigate clock-discrepancy problems
      delta=FILES IN FOLDER /seq/tableau_files/VVPVolumeQC TIMESTAMPED BETWEEN /seq/tableau_files/VVPVolumeQC/VVP_etl_timestamp.txt MINUS 5 MINUTES AND NOW

 

Final production-grade command would look like this


For testing purposes, an individual file can also be manually pushed by using curl
(ATTN: LAST_MODIFIED will not be populated in this case)
on Linux

on Windows


Alternative ways of automated “push” are also possible - for example push directly from within Perl script using the "Requests" library - all necessary parameters can be inferred from the curl commands above.

Caveats

  • JSON-queries appear to be sensitive to the Oracle 19.8.0 Bug 31532339 - ORA-600 [koksccda1]
    DBAs are working to address this by upgrading SEQPROD to v19.15.0

  • “analytics” VM is in our private network so it can’t be directly accessed from Google Cloud.
    However an “onprem”-script can easily read from GC and push to “analytics” VM

  • True: Cost of JSON-parsing from Tableau’s side would be more compared to old-fashioned Oracle tables. However for small scale applications this cost would be negligible.