/
FilePusher - push tsv/csv metrics-files to Analytics

FilePusher - push tsv/csv metrics-files to Analytics

Motivation

Historically, the way Analytics gets qc-metrics has been by “pulling” the files from file system, parsing fields and load them into appropriate tables. This method has few disadvantages:

  • files must be made accessible to “analytics” VM. Often this requires respective teams to be manually moving files into “Tableau_Files” shared folder on regular basis

  • each filetype (vvp, smartseq,etc) requires different parsing logic and dedicated table

  • often imperfect data (like NaN, malformed timestamps) is able to derail entire ETL pipeline

Wouldn’t it be great to have a generic “Metrics web-service + FilePush“ solution where:

  • qc-files can be left at their original location (no copy required)

  • file-system sharing is not required, tcp/ip is enough (so that if Oppy needs to send some qc-metrics from Mars it would be able to do so)

  • no need to create new customized tables for each application - everything (vvp, smartseq, etc) would be transformed into JSON and stored in METRICS table

  • resilience - at most imperfect data would lead to individual fields showing up as blank

  • straightforward JSON extraction in Tableau by using Oracle’s JSON constructs

Setup

  1. Register your new application (for example “vvp”) with the “Metrics” service (contact @Atanas Mihalev) and specify if files are “csv” or “tsv”. You will be assigned a “SITE_ID” for your app.

  2. setup a “FilePusher” solution to push the files on regular basis

What is “FilePusher”

It’s a tiny Scala/Java app which wakes up on some schedule, grabs the new “delta” files and push them to “Metrics” web service over tcp/ip. It can be deployed wherever your files are.

check if your server has tcp/ip connectivity to “analytics” VM

ping analytics.broadinstitute.org

make sure you have Java installed.

java -version

You might have to specify the full path to your “java” executable if PATH-variable is not updated.
Tool has been tested with Java8.

Copy the “remote-assembly-1.0.jar" (provided by analytics) locally on your server

Try to “push” one file like this (make sure to adjust parameters for your application)

/broad/software/free/Linux/redhat_6_x86_64/pkgs/jdk1.8.0_121/bin/java -cp /home/unix/analytics/TigerETL3/remote-assembly-1.0.jar analytics.tiger.remote.FilePusher 'dryrun=false' 'url=https://analytics.broadinstitute.org/Metrics?type=vvp' 'filepathRegex=(.*)/(.*AspAllOutputQC.csv)' 'delta=FILES_CSV /seq/tableau_files/VVPVolumeQC/20221205_RACK_QC_083303_AspAllOutputQC.csv'

you should get output like this

 

“Metrics” web-service accepts the CSV/TSV file, converts it into JSON and store it in METRICS datamart

 

Go to ANALYTICS on SEQPROD db and check if your file/metrics are in METRICS datamart

“FilePusher” can be customized using these parameters:

  • url=https://analytics.broadinstitute.org/Metrics?type=vvp
    where the file should be sent to

  • filepathRegex=(.*)/(.*AspAllOutputQC.csv)
    which file-names should be picked

  • outer_columns=field1=$1,field2=$2,field3=XYZ
    parse fields out of the filepathRegex and injects them in the JSON (useful when bits of data is encoded in the filename)

  • dryrun=true
    run all the delta capturing and regex-parsing and show the data without actually pushing the file

  • delta=<delta specification>
    specifies ”how” files to be picked

    • pick 1 (or multiple) specific files
      delta=FILES_CSV /seq/tableau_files/VVPVolumeQC/20221205_RACK_QC_083303_AspAllOutputQC.csv

    • pick files timestamped between this and that timestamp
      delta=FILES IN FOLDER /seq/tableau_files/VVPVolumeQC TIMESTAMPED BETWEEN 2019-10-19 13:11:46 AND 2019-10-19 13:11:46

    • pick files timestamped in last 10 hours
      delta=FILES IN FOLDER /seq/tableau_files/VVPVolumeQC TIMESTAMPED BETWEEN -10h AND NOW

    • pick files timestamped between persisted-in-file-timestamp and now (production setup)
      delta=FILES IN FOLDER /seq/tableau_files/VVPVolumeQC TIMESTAMPED BETWEEN /seq/tableau_files/VVPVolumeQC/VVP_etl_timestamp.txt AND NOW
      The file merely contains a timestamp(2019-10-19 13:11:46) and should be created manually

    • adjust delta so that you can mitigate clock-discrepancy problems
      delta=FILES IN FOLDER /seq/tableau_files/VVPVolumeQC TIMESTAMPED BETWEEN /seq/tableau_files/VVPVolumeQC/VVP_etl_timestamp.txt MINUS 5 MINUTES AND NOW

 

Final production-grade command would look like this


For testing purposes, an individual file can also be manually pushed by using curl
(ATTN: LAST_MODIFIED will not be populated in this case)
on Linux

on Windows


Alternative ways of automated “push” are also possible - for example push directly from within Perl script using the "Requests" library - all necessary parameters can be inferred from the curl commands above.

Caveats

  • JSON-queries appear to be sensitive to the Oracle 19.8.0 Bug 31532339 - ORA-600 [koksccda1]
    DBAs are working to address this by upgrading SEQPROD to v19.15.0

  • “analytics” VM is in our private network so it can’t be directly accessed from Google Cloud.
    However an “onprem”-script can easily read from GC and push to “analytics” VM

  • True: Cost of JSON-parsing from Tableau’s side would be more compared to old-fashioned Oracle tables. However for small scale applications this cost would be negligible.

 

Related content

File Pusher (FlatFileMetrics helper)
File Pusher (FlatFileMetrics helper)
More like this
FsMetrics VirtualView - make a TSV/CSV metrics file in the FileSystem look like an Oracle view
FsMetrics VirtualView - make a TSV/CSV metrics file in the FileSystem look like an Oracle view
More like this
ANALYTICS Web Server
ANALYTICS Web Server
More like this
Tableau REST API on steroids (Scala wrapper)
Tableau REST API on steroids (Scala wrapper)
More like this
Gmetrics VirtualView - make a Google TSV metrics file look like an Oracle view
Gmetrics VirtualView - make a Google TSV metrics file look like an Oracle view
More like this
Jira Extraction API v2 (JQL-ETL Edition)
Jira Extraction API v2 (JQL-ETL Edition)
More like this