Motivation
...
qc-files can be left at their original location (no copy required)
file-system sharing is not required, tcp/ip is enough (so that if Oppy needs to send some qc-metrics from Mars it would be able to do so)
no need to create new customized tables for each application - everything (vvp, smartessmartseq, etc) would be transformed into JSON and stored in METRICS table
resilience - at most imperfect data would lead to individual fields showing up as blank
straightforward JSON extraction in Tableau by using Oracle’s JSON constructs
...
url=https://analytics.broadinstitute.org/Metrics?type=vvp
where the file should be sent tofilepathRegex=(.*)/(.*AspAllOutputQC.csv)
which file-names should be pickedouter_columns=field1=$1,field2=$2,field3=XYZ
parse fields out of the filepathRegex and injects them in the JSON (useful when bits of data is encoded in the filename)dryrun=true
run all the delta capturing and regex-parsing and show the data without actually pushing the filedelta=<delta specification>
specifies ”how” files to be pickedpick 1 (or multiple) specific files
delta=FILES_CSV /seq/tableau_files/VVPVolumeQC/20221205_RACK_QC_083303_AspAllOutputQC.csvpick files timestamped between this and that timestamp
delta=FILES IN FOLDER /seq/tableau_files/VVPVolumeQC TIMESTAMPED BETWEEN 2019-10-19 13:11:46 AND 2019-10-19 13:11:46pick files timestamped in last 10 hours
delta=FILES IN FOLDER /seq/tableau_files/VVPVolumeQC TIMESTAMPED BETWEEN -10h AND NOWpick files timestamped between persisted-in-file-timestamp and now (production setup)
delta=FILES IN FOLDER /seq/tableau_files/VVPVolumeQC TIMESTAMPED BETWEEN /seq/tableau_files/VVPVolumeQC/VVP_etl_timestamp.txt AND NOW
The file merely contains a timestamp(2019-10-19 13:11:46) and should be created manually
For testing purposes, an individual file can also be manually pushed by using curl:
on Linux
...
breakoutMode | wide |
---|
...
adjust delta so that you can mitigate clock-discrepancy problems
delta=FILES IN FOLDER /seq/tableau_files/VVPVolumeQC TIMESTAMPED BETWEEN /
...
on Windows
...
breakoutMode | wide |
---|
...
seq/tableau_files/VVPVolumeQC/VVP_etl_timestamp.txt MINUS 5 MINUTES AND NOW
Final production-grade command would look like this
Code Block | ||
---|---|---|
| ||
/broad/software/free/Linux/redhat_6_x86_64/pkgs/jdk1.8.0_121/bin/java -cp /home/unix/analytics/TigerETL3/remote-assembly-1.0.jar analytics.tiger.remote.FilePusher 'dryrun=false' 'url=https://analytics.broadinstitute.org/Metrics?type=vvp' 'filepathRegex=(.*)/(.*AspAllOutputQC.csv)' 'delta=FILES IN FOLDER /seq/tableau_files/VVPVolumeQC TIMESTAMPED BETWEEN /seq/tableau_files/VVPVolumeQC/VVP_etl_timestamp.txt AND NOW' |
...
For testing purposes, an individual file can also be manually pushed by using curl
(ATTN: LAST_MODIFIED will not be populated in this case)
on Linux
Code Block | ||
---|---|---|
| ||
curl -F file="@/seq/tableau_files/VVPVolumeQC/20221205_RACK_QC_083303_AspAllOutputQC.csv" "https://analytics.broadinstitute.org/Metrics?type=vvp" |
on Windows
Code Block | ||
---|---|---|
| ||
curl -F file="@c:\myfolder\20221205_RACK_QC_083303_AspAllOutputQC.csv" "https://analytics.broadinstitute.org/Metrics?type=vvp" |
...
Alternative ways of automated “push” are also possible - for example push directly from within Perl script using the "Requests" library - all necessary parameters can be inferred from the curl commands above.
Caveats
JSON-queries appear to be sensitive to the Oracle 19.8.0 Bug 31532339 - ORA-600 [koksccda1]
DBAs are working to address this by upgrading SEQPROD to v19.15.0“analytics” VM is in our private network so it can’t be directly accessed from Google Cloud.
However an “onprem”-script can easily read from GC and push to “analytics” VMTrue: Cost of JSON-parsing from Tableau’s side would be more compared to old-fashioned Oracle tables. However for small scale applications this cost would be negligible.