FilePusher - push tsv/csv metrics-files to Analytics
Motivation
Historically, the way Analytics gets qc-metrics has been by “pulling” the files from file system, parsing fields and load them into appropriate tables. This method has few disadvantages:
files must be made accessible to “analytics” VM. Often this requires respective teams to be manually moving files into “Tableau_Files” shared folder on regular basis
each filetype (vvp, smartseq,etc) requires different parsing logic and dedicated table
often imperfect data (like NaN, malformed timestamps) is able to derail entire ETL pipeline
Wouldn’t it be great to have a generic “Metrics web-service + FilePush“ solution where:
qc-files can be left at their original location (no copy required)
file-system sharing is not required, tcp/ip is enough (so that if Oppy needs to send some qc-metrics from Mars it would be able to do so)
no need to create new customized tables for each application - everything (vvp, smartseq, etc) would be transformed into JSON and stored in METRICS table
resilience - at most imperfect data would lead to individual fields showing up as blank
straightforward JSON extraction in Tableau by using Oracle’s JSON constructs
Setup
Register your new application (for example “vvp”) with the “Metrics” service (contact @Atanas Mihalev) and specify if files are “csv” or “tsv”. You will be assigned a “SITE_ID” for your app.
setup a “FilePusher” solution to push the files on regular basis
What is “FilePusher”
It’s a tiny Scala/Java app which wakes up on some schedule, grabs the new “delta” files and push them to “Metrics” web service over tcp/ip. It can be deployed wherever your files are.
check if your server has tcp/ip connectivity to “analytics” VM
ping analytics.broadinstitute.org
make sure you have Java installed.
java -version
You might have to specify the full path to your “java” executable if PATH-variable is not updated.
Tool has been tested with Java8.
Copy the “remote-assembly-1.0.jar
" (provided by analytics) locally on your server
Try to “push” one file like this (make sure to adjust parameters for your application)
/broad/software/free/Linux/redhat_6_x86_64/pkgs/jdk1.8.0_121/bin/java -cp /home/unix/analytics/TigerETL3/remote-assembly-1.0.jar analytics.tiger.remote.FilePusher 'dryrun=false' 'url=https://analytics.broadinstitute.org/Metrics?type=vvp' 'filepathRegex=(.*)/(.*AspAllOutputQC.csv)' 'delta=FILES_CSV /seq/tableau_files/VVPVolumeQC/20221205_RACK_QC_083303_AspAllOutputQC.csv'
you should get output like this
“Metrics” web-service accepts the CSV/TSV file, converts it into JSON and store it in METRICS datamart
Go to ANALYTICS on SEQPROD db and check if your file/metrics are in METRICS datamart
“FilePusher” can be customized using these parameters:
url=https://analytics.broadinstitute.org/Metrics?type=vvp
where the file should be sent tofilepathRegex=(.*)/(.*AspAllOutputQC.csv)
which file-names should be pickedouter_columns=field1=$1,field2=$2,field3=XYZ
parse fields out of the filepathRegex and injects them in the JSON (useful when bits of data is encoded in the filename)dryrun=true
run all the delta capturing and regex-parsing and show the data without actually pushing the filedelta=<delta specification>
specifies ”how” files to be pickedpick 1 (or multiple) specific files
delta=FILES_CSV /seq/tableau_files/VVPVolumeQC/20221205_RACK_QC_083303_AspAllOutputQC.csvpick files timestamped between this and that timestamp
delta=FILES IN FOLDER /seq/tableau_files/VVPVolumeQC TIMESTAMPED BETWEEN 2019-10-19 13:11:46 AND 2019-10-19 13:11:46pick files timestamped in last 10 hours
delta=FILES IN FOLDER /seq/tableau_files/VVPVolumeQC TIMESTAMPED BETWEEN -10h AND NOWpick files timestamped between persisted-in-file-timestamp and now (production setup)
delta=FILES IN FOLDER /seq/tableau_files/VVPVolumeQC TIMESTAMPED BETWEEN /seq/tableau_files/VVPVolumeQC/VVP_etl_timestamp.txt AND NOW
The file merely contains a timestamp(2019-10-19 13:11:46) and should be created manuallyadjust delta so that you can mitigate clock-discrepancy problems
delta=FILES IN FOLDER /seq/tableau_files/VVPVolumeQC TIMESTAMPED BETWEEN /seq/tableau_files/VVPVolumeQC/VVP_etl_timestamp.txt MINUS 5 MINUTES AND NOW
Final production-grade command would look like this
For testing purposes, an individual file can also be manually pushed by using curl
(ATTN: LAST_MODIFIED will not be populated in this case)
on Linux
on Windows
Alternative ways of automated “push” are also possible - for example push directly from within Perl script using the "Requests" library - all necessary parameters can be inferred from the curl commands above.
Caveats
JSON-queries appear to be sensitive to the Oracle 19.8.0 Bug 31532339 - ORA-600 [koksccda1]
DBAs are working to address this by upgrading SEQPROD to v19.15.0“analytics” VM is in our private network so it can’t be directly accessed from Google Cloud.
However an “onprem”-script can easily read from GC and push to “analytics” VMTrue: Cost of JSON-parsing from Tableau’s side would be more compared to old-fashioned Oracle tables. However for small scale applications this cost would be negligible.