Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 8 Next »

Motivation

Jira-data extracted via JqlTableauExtractService is currently being recorded into Tableau hyper-extracts. Unfortunately Tableau’s ability to deal properly with delta-driven updates is very limited (it can only work on datasets with timeUpdated field) which has encouraged Analytics engineers to implement these “grab-everything” kind of jobs which do NOT have any time-based filter - as a result these jobs keep growing in size and time every day for eternity and Jira might start showing signs of (unnecessary) overload.

It would be great if jira-extract is rerouted into a regular datamart into DB which offers many options for advanced delta-driven ETL-handling.

Step 1: Use Einstein to cook up your NormalizedUrl (aka jqlLink)

Step 2: Create a corresponding table in the DB

CREATE TABLE ANALYTICS.BQMS_EXPLODED_SAMPLES
  (
  KEY VARCHAR2 (100),
  SUMMARY VARCHAR2 (100),
  STATUS VARCHAR2 (50),
  REPORTER VARCHAR2 (60),
  PRODUCT_AND_PROCESS VARCHAR2 (100),
  SAMPLE_ID VARCHAR2 (50),
  UPDATED DATE
 )
/

CREATE INDEX ANALYTICS.BQMS_EXPLODED_SAMPLES_IDX1 ON BQMS_EXPLODED_SAMPLES(KEY ASC) 
/

GRANT ALL ON BQMS_EXPLODED_SAMPLES TO analyticsetl
/

Step 3: Register a new JqlTask in “/home/unix/analytics/TigerETL3/jqlTasks.conf” file

Step 4: Test your ETL in manual mode

RUN3 is variable you should have defined in your TigerEnvironment (if it gives you trouble you can just replace it with /home/unix/analytics/TigerETL3/runEtlAgent.sh)

Step 5: Prepare a delta-tracker

Run following INSERT-statement and commit. Make sure you plug your TASK_NAME in appropriate place. This tracker will drive your ETL in automatic delta-driven mode. Pick a timestamp from which you want your ETL to start off.

INSERT INTO cognos.etl_property VALUES('analytics.tiger.agents.JqlTask.YOUR_TASK_NAME_HERE','2000-Jan-01 00:00:00')

Step 6: Test your ETL in delta-driven mode

Step 7: Schedule a cronjob to reach full automation

In order to avoid interference with production, you might want to put it 1st in your private crontab

crontab -e

HOST=analytics
MAILTO=atanas@broadinstitute.org
PATH=/bin:/broad/software/free/Linux/redhat_7_x86_64/pkgs/jdk1.8.0_121/bin:$PATH
TIGER_HOME=/home/unix/analytics/TigerETL3
SPARK_HOME=/local/spark-2.3.1-bin-hadoop2.7
RUN=/home/unix/analytics/TigerETL/runEtlAgent2
RUN3=/home/unix/analytics/TigerETL3/runEtlAgent.sh
# +--------- Minute (0-59)                    | Output Dumper: >/dev/null 2>&1
# | +------- Hour (0-23)                      | Multiple Values Use Commas: 3,12,47
# | | +----- Day Of Month (1-31)              | Do every X intervals: */X  -> Example: */15 * * * *  Is every 15 minutes
# | | | +--- Month (1 -12)                    | Aliases: @reboot -> Run once at startup; @hourly -> 0 * * * *;
# | | | | +- Day Of Week (0-6) (Sunday = 0)   | @daily -> 0 0 * * *; @weekly -> 0 0 * * 0; @monthly ->0 0 1 * *;
# | | | | |                                   | @yearly -> 0 0 1 1 *;
# * * * * * COMMAND
#
YOUR-CRONJOB-HERE

Have it running like this for couple of days and then move your cronjob to production crontab.

HAPPY END (smile)

Some thoughts:

  1. So called “JQL-explosion” (splitting given field - say “SampleIDs -“ into items and combining them with the rest of fields for given ticket) seems convenient however it is very wasteful - all non-exploded fields will be duplicated as many times as #samples are found. This could possibly lead to performance problems.

Alternative normalized approach worth looking into is “having 2 ETL tasks”

  • 1st ETL would take care of all non-exploded fields. There will be no explosion (resp. duplication) and “key” would naturally be a PrimaryKey

  • 2nd ETL would include only 2 fields (key and Sample). This would produce a very lean table having all (key → Sample) links.

both ETLs could be placed in the same task in jqlTasks.conf file

These 2 tables would be nicely equipped with indices/PKs so JOINs should perform fast.

2.

  • No labels