Code

Qlik Replicate: You’re trapped in a Docker container now!

In Qlik Replicate we tasks unable to resume when we have nasty server failures (for instant the CrowdStrike outage in July 2024).

This only happens in tasks that are impacted are a RDBMS to a cloud storage system like AWS S3 or GCS. 

In the task log the error message takes the form of:

00002396: 2022-08-26T15:21:14 [AT_GLOBAL ]E: Json doesn't start with '{' [1003001] (at_cjson.c:1773)
00002396: 2022-08-26T15:21:14 [AT_GLOBAL ]E: Cannot parse json: [1000251] (at_protobuf.c:1420)

This error gives us problems; I can’t resume the task as the error re-appears.  I can’t even start it from the stream position and must rely on restarting the QR task from a timestamp, which is extremely dangerous with the chance of missing out on data for that split of a second.

I suspect the problem is that the “staging” file on the QR server gets corrupted mid write when the server fails and when resume; QR can’t parse it.

But trying to recreate the problem in a safe environment to diagnose it is tricky.  Our DTL environment doesn’t create enough traffic to trigger the issue.  Also, I don’t want to be abruptly turning off our DTL QR servers and interrupting other people’s testing.  As for trying to recreate the problem in production – the pain of all the red tape is not worth the effort.

I needed a safer space to work in.  A space when I can pump through large volumes of data through QR and kick the QR service around trying to provoke the error.  Armed with my little Linux VM – docker containers was the answer.

CentOS? Why CentOS?

My goal was to build a Docker container with Qlik Replicate and Postgres drivers so I can use it on my Linux VM.

Under Support articles, Qlik has a guide on how to run Qlik Replicate in a Docker container.

Following the instructions I ran into some initial problems.  The first major problem was using the Cent OS docker image.  The issue was that I must use the packages in my company’s artifactory and not external packages.  Although the company had CentOS; there was no other packages available to update and install.  Since my VM cannot reach http://vault.centos.org; the CentOS image was a lame duck.

With CentOS off the cards, I had to use Redhat image that my company provided.  With Redhat – the artifactory had all the packages that I needed.

The second problem was that I was wanting to use the 2023.11 image to match our environment.  With 2023.11 there are some extra steps needed in the docker file compared to 2024.05.  The differences is notated on Qlik’s support article.

The Dockerfile

Here is the Dockerfile

FROM my.companys.repo/redhat/ubi9


ENV QLIK_REPLICATE_BASE_DIR=/opt/attunity/replicate/
ENV ReplicateDataFolder=/replicate/data
ENV ReplicateAdminPassword=AB1gL0ngPa33w0rd
ENV ReplicateRestPort=3552
ENV LicenseFile=/tmp/replicate_license_exp2025-06-29_ser60038556.txt

# Copy across installation packages and licenses
ADD postgresql*.rpm /tmp/
ADD areplicate-*.rpm /tmp/
ADD systemctl /usr/sbin
ADD replicate_license_exp2025-06-29_ser60038556.txt /tmp/

# Update packages
RUN dnf -y update
RUN dnf makecache

# To get ps command
RUN dnf -y install procps-ng
RUN dnf -y install unixODBC unzip
RUN dnf -y install libicu.x86_64
RUN rm -f /etc/odbcinst.ini

# Installing posgres packages
RUN rpm -ivh /tmp/postgresql13-libs-13.9-1PGDG.rhel9.x86_64.rpm
RUN rpm -ivh /tmp/postgresql13-odbc-13.02.0000-2PGDG.rhel9.x86_64.rpm
RUN rpm -ivh /tmp/postgresql13-13.9-1PGDG.rhel9.x86_64.rpm

ADD odbcinst.ini /etc/

# Installing Qlik Replicate
RUN systemd=no yum -y install /tmp/areplicate-2023.11.0-468.x86_64.rpm
RUN yum clean all
RUN rm -f /tmp/areplicate-*.rpm

RUN export LD_LIBRARY_PATH=/opt/attunity/replicate/lib:\$LD_LIBRARY_PATH
RUN echo "export LD_LIBRARY_PATH=/usr/pgsql-13/lib:\$LD_LIBRARY_PATH" >> /opt/attunity/replicate/bin/site_arep_login.sh

ADD start_replicate.sh /opt/attunity/replicate/bin/start_replicate.sh
RUN chmod 775 /opt/attunity/replicate/bin/start_replicate.sh
RUN chown attunity:attunity /opt/attunity/replicate/bin/start_replicate.sh
RUN source $QLIK_REPLICATE_BASE_DIR/bin/arep_login.sh >>~attunity/.bash_profile
ENTRYPOINT /opt/attunity/replicate/bin/start_replicate.sh ${ReplicateDataFolder} ${ReplicateAdminPassword} ${ReplicateRestPort} ${LicenseFile} ; tail -f /dev/null

The postgres packages can be obtained from https://download.postgresql.org/pub/repos/yum/13/redhat/rhel-9-x86_64/

Th file odbcinst.ini content is:

[PostgreSQL]
Description = ODBC for PostgreSQL
Driver      = /usr/lib/psqlodbcw.so
Setup       = /usr/lib/libodbcpsqlS.so
Driver64    = /usr/pgsql-13/lib/psqlodbcw.so
Setup64     = /usr/lib64/libodbcpsqlS.so
FileUsage   = 1

The systemctl file is:

# Run LS command - remove this line 
ls

And of course you need the rpm for Qlik replicate and your license file.

Once the Dockerfile and files are collated in a directory; build the container with:

docker build --no-cache -t ccc/replicate:2023.11 .

If all goes well – a Docker contain will be built and ready to be used.

Docker Compose

To make running the docker images easier; create a docker compose file:

version: '3.3'

services:
  replicate:
    image: docker.io/ccc/replicate:2023.11
    container_name: replicate_2023_11
    ports: 
      - "3552:3552"

    environment:
      - ReplicateRestPort=3552
      - TZ=Australia/Melbourne

    volumes:
      - /dockermount/data/replicate/data:/replicate/data

    extra_hosts:
      - host.docker.internal:host-gateway

volumes:
  replicate:

Save the docker-compose.yml in a directory and from the directory start the container with the command:

docker-compose up -d

If everything is working – run the docker ps command to verify everything is working:

docker ps

So far looking good. Further conformation can be had by connecting into the container and observe the QR processes running:

docker exec -it qr_container_id bash
ps -aux

There should be two main processes; plus a process for each individual QR tasks running:

With everything confirmed – QR console can be accessed from a browser.

https://127.0.0.1:3552/attunityreplicate/

Qlik Replicate – Quickly backing up the Data directory

Just a short post on a trick I learnt when upgrading a whole lot of QR boxes on Windows.

As part of the upgrade process; you need to back up the data directory encase you need to back out the  upgrade.

The Data directory by default is located in:

C:\Program Files\Attunity\Replicate\data\

Using the inbuilt Windows compression tool (to me) seems slow and sometime unreliable; especially if there is a large amount of data staged when QR was turned off to do the upgrade.

Then I remember that 7zip can be used from the command line.  Also an added feature is that 7zip can exclude certain directories so the often large “log” directory can be excluded.

7z a -r -t7z "C:\Program Files\Attunity\Replicate\data.7z" "C:\Program Files\Attunity\Replicate\data\*.*" -xr!logs

This command can also be used in automating the QR upgrade.

Running Postgres and pgadmin through Docker on a Dodgy Linux VM

In the organisation that I work in; I have a good old Windows 10 ThinkPad that has been my work horse for many years. 

It does the job and works happily with our on Prem apps and I can do most of my role’s development on it.

There are areas where the work horse cannot help me out.  For instance, I needed to develop a JavaScript function on a Postgres database to split a field into different elements.  I do not have access to be able to develop on the target database; so, I turned to Docker to run a containerised version of Postgres and pgadmin to have a safe area to play in.

The dreaded Linux VM

The “cool” developers have access to Macs to run their DevOps tools on. 

I have a Linux VM, running Ubuntu 20.04 on.

It loads slow, it runs slow and the support VM application hogs a significant amount of the memory available, leaving little left for me. 

But does allow me to run Docker containers.

The first container I created; broke the VM.  The VM support team speculated that a port for Postgres or pgadmin broke the organisation’s VM ports.  They rebuilt my VM and I tried again.

docker-compose.yml

This is my docker-compose.yml file for Postgres and pgadmin

version: '3.3'

services:
  db:
    #image: postgres
    image: clkao/postgres-plv8
    container_name: local_pgdb
    restart: always
    ports:
      - "9432:5432"
    environment:
      - POSTGRES_PASSWORD=verystrongpassword
      - POSTGRES_USER=jonny 
      - POSTGRES_DB=work

    volumes:
      #- ~/apps/postgres:/var/lib/postgresql/data
      - ~/apps/postgres-plv8:/var/lib/postgresql/data
    
  pgadmin:
    image: dpage/pgadmin4
    container_name: pgadmin4_container
    restart: always
    ports:
      - "9888:80"
      - "9443:443"
   
    environment:
       PGADMIN_DEFAULT_EMAIL: jonny@craftcookcode.com
       PGADMIN_DEFAULT_PASSWORD: verystrongpassword
       
       # Fix for IPv6-disabled systems https://stackoverflow.com/questions/68766411/pgadmin-4-in-docker-failed-with-gunicorn-server-start-error
       PGADMIN_LISTEN_ADDRESS: 0.0.0.0

    volumes:
      - ~/apps/pg_admin/pgadmin-data:/var/lib/pgadmin

volumes:
  local_pgdata:

There a couple of changes from the boiler plate docker-compose.yml files on the internet:

  • The ports are mapped to non-standard ports. This is to avoid any potential problems with ports conflicting with the VM software
  • I had to change the volumes to my home drive due to security settings on my VM

Errors, Problems and Issues (Oh my)

When initially running the docker-compose; I got the following error and pgadmin wouldn’t start.

pgadmin4_container | ERROR  : Failed to create the directory /var/lib/pgadmin/sessions:
pgadmin4_container |            [Errno 13] Permission denied: '/var/lib/pgadmin/sessions'
pgadmin4_container | HINT   : Create the directory /var/lib/pgadmin/sessions, ensure it is writeable by
pgadmin4_container |          'pgadmin', and try again, or, create a config_local.py file
pgadmin4_container |          and override the SESSION_DB_PATH setting per
pgadmin4_container |          https://www.pgadmin.org/docs/pgadmin4/8.9/config_py.html

This issue was resolved from an article from a Stack Overflow thread by changing the ownership of the pg_admin trigger to 5050

sudo chown -R 5050:5050 ~/apps/pg_admin

The next error I had was a “Can’t connect to (‘::’, 80)” error in pgadmin

pgadmin4_container | [2024-07-09 05:25:58 +0000] [1] [INFO] Starting gunicorn 22.0.0
pgadmin4_container | [2024-07-09 05:25:58 +0000] [1] [ERROR] Retrying in 1 second.
pgadmin4_container | [2024-07-09 05:25:59 +0000] [1] [ERROR] Retrying in 1 second.
pgadmin4_container | [2024-07-09 05:26:00 +0000] [1] [ERROR] Retrying in 1 second.
pgadmin4_container | [2024-07-09 05:26:01 +0000] [1] [ERROR] Retrying in 1 second.
pgadmin4_container | [2024-07-09 05:26:02 +0000] [1] [ERROR] Retrying in 1 second.
pgadmin4_container | [2024-07-09 05:26:03 +0000] [1] [ERROR] Can't connect to ('::', 80)

Again Google and Stack Overflow came to the rescue. The issue was potentially caused if IPv6 is disabled on the VM. I added in the the following line to the yml file:

PGADMIN_LISTEN_ADDRESS: 0.0.0.0

This resolved the issue and now pgadmin could start up.

Inside pgadmin

When I got inside pgadmin; for the life of me I couldn’t connect to the Postgres database.

I could see that the Postres container was running with no errors. I could see the remapped ports. I could connect to Postgres with psql. Why couldn’t I connect to the Postgres in pgadmin?

I was frustrated and tired after a long day of work and had walked away from the computer.

When I got back after a walk around the block and a cup of tea – I could now see the problem and the solution:

Initially I was using “Hostname” as 127.0.0.1 and port as 9432. Because that’s where my mind went to how to connect to the Postgres database running locally.

But because pgadmin is trying to access Postgres from within the docker network; it will be looking for port 5432 instead of 9432 and the container name local_pgdb instead of 127.0.0.1

If I am running from outside the docker; I would use localhost and port 9432. For instance I imported some data to develop off:

psql --host localhost --port 9432 --username jonny -d work -f ~/some_postgres_data.dmp

Once I entered the right details; pgadmin works fine connecting and I could develop the Postgres function in a safe area.

Qlik Replicate – Fight of the filters. Who will prevail?

Background

The business is doing their best to keep me on my toes with Qlik Replicate; finding new was to bend and stretch the system and consequently my sanity.

The initial request was, “Can we overwrite this field in a Qlik Replicate task with a SOURCE_LOOKUP?”

OK – we can do this.  I abhor putting ETL logic in Qlik Replicate tasks and wanting to keep them as simple as possible and allow the power and the flexibility of the downstream systems to manipulate data.

But project timelines were pressing, and I complied with their request.

Later, they came back to me and requested a to add a filter to the derived field in question.

And that led to me and our Tech Business analyst scratching our heads. 

If we apply a filter to our focus field; will it use the raw field that is in the table? Or will is use the new lookup field with the same name to base the filter on?

Testing the filters – setting up

To start with; some simple tables in MS-SQL:

CREATE TABLE dbo.TEST_LOOKUP
(
	ACCOUNT_ID INT PRIMARY KEY,
	FRUIT_ID INT,
	FRUIT_NAME VARCHAR(100),
	SOURCE_NAME VARCHAR(100)
);

GO

CREATE TABLE dbo.FRUITS
(
	FRUIT_ID INT PRIMARY KEY,
	NEW_FRUIT_NAME VARCHAR(100)
)

GO

INSERT INTO dbo.FRUITS VALUES(1, 'NEW APPLES');
INSERT INTO dbo.FRUITS VALUES(2, 'NEW ORANGES');

A simple Qlik Replicate task was created to replicate from the table dbo.TEST_LOOKUP.

All columns were brought across instead of FRUIT_NAME. FRUIT_NAME will be overwritten with the source lookup:

source_lookup('NO_CACHING','dbo','FRUITS','NEW_FRUIT_NAME','FRUIT_ID =?',$FRUIT_ID)

To test; a simple insert was added to ensure that the source lookup is working correctly:

INSERT INTO dbo.TEST_LOOKUP VALUES(1, 1, 'OLD APPLES', 'Truck');

Result:

{
    "magic": "atMSG",
    "type": "DT",
    "headers": null,
    "messageSchemaId": null,
    "messageSchema": null,
    "message": {
        "data": {
            "ACCOUNT_ID": 1,
            "FRUIT_ID": 1,
            "SOURCE_NAME": "Truck",
            "FRUIT_NAME": "NEW APPLES"
        },
        "beforeData": null,
        "headers": {
            "operation": "INSERT",
            "changeSequence": "20240529060703760000000000000000005",
            "timestamp": "2024-05-29T06:07:03.767",
            "streamPosition": "0071a49f:000f8e09:001c",
            "transactionId": "6EDBA1FA0E0000000000000000000000",
            "changeMask": "0F",
            "columnMask": "0F",
            "transactionEventCounter": 1,
            "transactionLastEvent": true
        }
    }
}

Everything is working correctly; FRUIT_NAMES got overwritten with “NEW APPLES” in the json output.

Testing the filters – placing bets

In the CDC task; a new filter was added:

$FRUIT_NAME == 'NEW ORANGES'

And the following SQL statement was run on the source system:

INSERT INTO dbo.TEST_LOOKUP VALUES(2, 2, 'OLD ORANGES', 'Fridge');

So – If Qlik Replicate filters on the base table’s field; the change WILL NOT be replicated through.

Likewise if Qlik Replicate is using the new derived field for filter; the change WILL come through.

And the results are…

{
    "magic": "atMSG",
    "type": "DT",
    "headers": null,
    "messageSchemaId": null,
    "messageSchema": null,
    "message": {
        "data": {
            "ACCOUNT_ID": 2,
            "FRUIT_ID": 2,
            "SOURCE_NAME": "Fridge",
            "FRUIT_NAME": "NEW ORANGES"
        },
        "beforeData": null,
        "headers": {
            "operation": "INSERT",
            "changeSequence": "20240529061434050000000000000000065",
            "timestamp": "2024-05-29T06:14:34.050",
            "streamPosition": "0071a49f:000f901a:0005",
            "transactionId": "28DCA1FA0E0000000000000000000000",    
            "changeMask": "17",
            "columnMask": "17",
            "transactionEventCounter": 1,
            "transactionLastEvent": true
        }
    }
}

Qlik will use the derived source lookup field over the original field in the table.

Conclusion

Once again this highlights the danger of putting ETL code into Qlik Replicate tasks. It obscures business rules and can lead to confusion in operations like the scenario above.

It is best to use Qlik Replicate to get the data out of the source database as quickly and as pure as possible and then use the power of the downstream systems to manipulate the data.

Qlik Replicate – The saga of replicating to AWS Part 5 – Is MS-SQL the answer?

For those who are following along at home…

We have been toiling on replicating to AWS Postgres RDS with Qlik Replicate for the past two months; trying to achieve a baseline of 300tps.

After many suggestions, tuning and tweaks, conference calls, benchmarks learning experiences, prayers; we couldn’t get our tps close to our baseline.

You can read the main findings in the following pages:

Things that were suggested to us that I did not add to this blog series:

  • Try Async commits on RDS Postgres.  Tried it and got negligible increases.
  • Shifting the QR server to AWS.  Bad idea as there will be even more traffic from the busy DB2 database going across for QR to consume.
  • Use Amazon Aurora instead of RDS.  The downstream developers did not have the appetite to try Aurora; especially with the issues leaning towards network speed.
  • Use GCP version of Postgres instead of AWS.  The downstream developers did not want to commit to another cloud provider.

The problem is how the network connectivity behaves with the Postgres ODBC and the round trips it must do between our location and the AWS data centre.  We can try – but we are bound by the laws of physics and the speed of light.

Decisions to be made.

All though our benchmarking and investigation; we have been replicating to a Development MS-SQL database in the data centre as the DB2 database in parallel to give us an idea of what speed we could potentially reach.  Without triggers on the MS-SQL destination table; we were easily hitting 300tps.  Ramping the changes up; we can hold at 1K tps with no creep in latency. 

We were happy with these results; especially with the MS-SQL database was just a small underpowered shared Dev machine; not a full-blown dedicated server.

It took a brave solution architect to propose that we shift from AWS RDS Postgres to an on prem MS-SQL server; especially when our senior management strategy is to push everything to the cloud to reduce the number of on prem servers.

In the end with all our evidence on the performance and the project’s willingness to push on with the proposed solution, the solution stakeholders agreed to move the destination to an on prem database.

They initially wanted us to go with a on prem Postgres database; but since all our Database Administrators are either Oracle or MS-SQL experts and we have no Postgres experts – we went to good old MS-SQL.

It worked; but…damn triggers.

I volunteered to convert the Postgres SQL code into T-SQL as I have worked with T-SQL for the past decade.  The conversion went smoothly, and I took the opportunity to optimise several sections of the code to make the solution more maintainable and to run faster.

With our new MS-SQL database all coded up and the triggers turned off; the SVT (stress and volume testing) ran at the TPS for which we were aiming.

But when we turned on the triggers; the performance absolutely crashed.

I was mortified – was it my coding shot and the additional changes that I made the performance worse?

I checked the triggers.  I checked the primary keys and the joins.  I checked the horizontal partitioning.  I checked the database server stats for CPU and memory usage. 

Nothing – could not locate the performance problem.

I went back to Qlik Replicate and examined the log files.

Ahh – here is something. The log file was full of entries like this:

00012472: 2024-05-27T16:03:54 [TARGET_APPLY ]W: Source changes that would have had no impact were not applied to the target database. Refer to the 'attrep_apply_exceptions' table for details (endpointshell.c:7632)

Looking inside the attrep_apply_exceptions there corresponding entries like:

UPDATE [dbo].[TEST_DESTINATION] 
SET	[ACCOUNT_ID] = 2,
	[DATA_1] = 'Updated', 
	[DATA_2] = 'Data' 
WHERE 
	[ACCOUNT_ID] = 2;
	-- 0 rows affected

Which was confusing; I checked the destination table, and the update was applied.  Why was this update deemed a failure and logged to the attrep_apply_exceptions table? It must be an error in the trigger.

The cause of the problem

Our code can be paraphrased like:

CREATE TABLE dbo.TEST_DESTINATION
(
	ACCOUNT_ID int NOT NULL,
	DATA_1 varchar(100) NULL,
	DATA_2 varchar(100) NULL,
	PRIMARY KEY CLUSTERED 
	(
		ACCOUNT_ID ASC
	)
);

GO

CREATE TABLE dbo.TEST_MERGE_TABLE
(
	ACCOUNT_ID int NOT NULL,
	DATA_1 varchar(100) NULL,
	DATA_2 varchar(100) NULL,
	DATA_3 varchar(100) NULL,
	PRIMARY KEY CLUSTERED 
	(
		ACCOUNT_ID ASC
	)
)

GO

CREATE OR ALTER TRIGGER dbo.TR_TEST_DESTINATION__INSERT
ON dbo.TEST_DESTINATION
AFTER INSERT 
AS
	INSERT INTO dbo.TEST_MERGE_TABLE
	SELECT
		ACCOUNT_ID,
		DATA_1,
		DATA_2,
		'TRIGGER INSERT' AS DATA_3
	FROM INSERTED;

GO

CREATE OR ALTER TRIGGER [dbo].[TR_TEST_DESTINATION__UPDATE]
ON [dbo].[TEST_DESTINATION]
AFTER UPDATE 
AS
	UPDATE dbo.TEST_MERGE_TABLE
	SET ACCOUNT_ID = X.ACCOUNT_ID,
		DATA_1 = X.DATA_1,
		DATA_2 = X.DATA_2,
		DATA_3 = 'TRIGGER DATA'
	FROM dbo.TEST_MERGE_TABLE T
	JOIN INSERTED X
	ON X.ACCOUNT_ID = T.ACCOUNT_ID
	WHERE
		1 = 0;  -- This predicate can be either true or false.  For example we set it false

GO

The problem is in how the trigger TR_TEST_DESTINATION__UPDATE behaves if it returns 0 rows.  This can be a legitimate occurrence depending on a join in the trigger.

If I run a simple update like:

UPDATE dbo.TEST_DESTINATION
SET DATA_1 = 'Trigger Upate'
WHERE
	ACCOUNT_ID = 1;

The SQL engine returns:

(0 row(s) affected)    -- Returned from the trigger

(1 row(s) affected)    -- Returned from updating dbo.TEST_DESTINATION

My theory is that Qlik Replicate when reading the rows returned from executing the SQL statement on the destination server; only considers the first row to determine if the change was a success or not.  Since the first row is an output from the trigger with 0 rows affected; Qlik considers that the update was a failure and therefore logs it into the attrep_apply_exceptions table.

Apart from this been incorrect as the trigger code is logically working correctly; Qlik Replicate must make another trip to write to the exception table.  This resulted in drastically increased latency.

Fixing the issue

The fix (once the problem is known) is relatively straight forward. Any rows returned needs to be supressed from the trigger. For example:

CREATE OR ALTER TRIGGER [dbo].[TR_TEST_DESTINATION__UPDATE]
ON [dbo].[TEST_DESTINATION]
AFTER UPDATE 
AS
BEGIN
	SET NOCOUNT ON;  -- Supress returning the row count

	UPDATE dbo.TEST_MERGE_TABLE
	SET ACCOUNT_ID = X.ACCOUNT_ID,
		DATA_1 = X.DATA_1,
		DATA_2 = X.DATA_2,
		DATA_3 = 'TRIGGER DATA'
	FROM dbo.TEST_MERGE_TABLE T
	JOIN INSERTED X
	ON X.ACCOUNT_ID = T.ACCOUNT_ID
	WHERE
		1 = 0;  -- This predicate can be either true or false
END

When the update statement is run again; the following is returned:

(1 row(s) affected)

Qlik Replicate will now consider the update as a success and not log it as an exception.

Qlik Replicate – The saga of replicating to AWS Part 4 – Does stream size matter?

Continuing our ongoing Qlik Replicating story of trying to replicate a DB2/zOS database to AWS RDS Postgres.

We made small improvements; but nothing substantial to reach the TPS for which we were aiming.  I was at my experience end of what I knew and decided to reach for professional help.

We have a support relationship with IBT; who helped us out with the initial set up QR in our organisation.  But recently we have been self-resolving our own problems and have not been using their help.  Now this was suitable time to ask for their help.

IBT has always been helpful when we have asked for assistance.  Another handy aspect with the relationship is that IBT has a quick support relationship with Qlik.  If they don’t know the answer; they can get the answer easily from Qlik.

IBT asked us to collect the usual data; diagnostic logs, source and target DB metrics and QR server core metrics.  Nothing looked under duress, so IBT dived into the nitty and gritty details of the diagnostic packs.

Their techs noticed that our outgoing stream buffers were full.  This means that the changes were coming in faster than were getting sent out to the destination.  IBT suggested to try increasing the size of the outgoing stream.

Without going through the details of this step; here is a Qlik knowledge base article on Increasing the outgoing stream queue.

“She’s breaking up – I can’t hold her”

We started off with:

"stream_buffers_number" : 5,
"stream_buffer_size" : 10,

It was a marginal improvement.  Measured in a thimble full of performance improvement.  Still nowhere near the TPS we needed. 

IBT asked us to increase the two variables in small increments of “stream_buffers_number” + 5 and “stream_buffer_size” + 10.  With each increase there was a minuscule improvement.

 But more worrying with each increase; the QR task was using more memory to the point that increasing the buffer size was unsustainable with the resources on the server.  Even if increasing the buffer variables and the gained TPS was linear relationship; we would need a very beefy server to reach 300 TPS.

So again, it was a little gain; and with all the added “Little gains” over the past few fix iterations we were still no closer to our needed 300 TPS.

Increasing the buffer variables might be helpful if you are close to your TPS and trying to get over the last hurdle.  But since we’re so far behind; we had to look for another solution.

Qlik Replicate – The saga of replicating to AWS Part 3 – Wireshark!

Continuing on the story

After concluding that the low TPS is not resulting from poor query performance; our attention was turned to the network latency between our OnPrem Qlik system and the AWS RDS database.

First, I asked the networks team if there were any suspect networking components between our on-premise’s Qlik server and the AWS DB.  Anything like IPS, QOS, bandwidth limitation components that could explain the slowdown.

I also asked the cloud team if they can find anything as well.

It was a high hope for them to find anything; but since they are the SMEs in the area, it was worth asking the question. 

As expected, they did not find anything.

But the Network team  did come back with a couple of pieces of information:

  • The network bandwidth to the AWS was wide enough and we were not reaching its capacity.
  • It is a 16ms – 20ms round trip from our Data centre to the AWS data centre. 

Loaction… Location…

Physically the distance to the AWS data centre is 700Km. 

Unfortunately, AWS set up a closer data centre in the past few years, which is only 130Km away.  We are not currently set up to use this new region yet.

The Network team gave me permission to install wire shark on our OnPrem Qlik server and our AWS EC2 Qlik server. 

From both servers with psql I connected to the AWS RDS database and updated one row; capturing the traffic using Wireshark.

I lined up the two results from the different servers to see if there was anything obvious

Wireshark results

(ip.src == ip.of.qlik.server and ip.dst == ip.of.aws.rds) or (ip.src == ip.of.aws.rds and ip.dst == ip.of.qlik.server)
SEQSourceDestinationProtocolLengthInfoOn Prem 2 RDSEC2 2 RDSDifference (sec)% of difference
1Qlik serverRDS DBTCP6658313 > 5432 [SYN, ECE, CWR] Seq=0 Win=64240 Len=0 MSS=1460 WS=256 SACK_PERM000.0000%
2RDS DBQlik serverTCP665432 > 58313 [SYN, ACK] Seq=0 Ack=1 Win=26883 Len=0 MSS=1460 SACK_PERM WS=80.0190.0010.01810%
3Qlik serverRDS DBTCP5458313 > 5432 [ACK] Seq=1 Ack=1 Win=262656 Len=00.0000.0000.0000%
4Qlik serverRDS DBPGSQL62>?0.0000.005-0.005-3%
5RDS DBQlik serverTCP605432 > 58313 [ACK] Seq=1 Ack=9 Win=26888 Len=00.0180.0000.01810%
6RDS DBQlik serverPGSQL60<0.0010.0010.0000%
7Qlik serverRDS DBTLSv1.3343Client Hello0.0040.0040.0010%
8RDS DBQlik serverTLSv1.3220Hello Retry Request0.0210.0010.02112%
9Qlik serverRDS DBTLSv1.3455Change Cipher Spec, Client Hello0.0030.0010.0021%
10RDS DBQlik serverTLSv1.3566Server Hello, Change Cipher Spec0.0230.0050.01911%
11RDS DBQlik serverTCP15145432 > 58313 [ACK] Seq=680 Ack=699 Win=29032 Len=1460 [TCP segment of a reassembled PDU]0.0000.0000.0000%
12RDS DBQlik serverTCP15145432 > 58313 [ACK] Seq=2140 Ack=699 Win=29032 Len=1460 [TCP segment of a reassembled PDU]0.0000.0000.0000%
13RDS DBQlik serverTCP15145432 > 58313 [ACK] Seq=3600 Ack=699 Win=29032 Len=1460 [TCP segment of a reassembled PDU]0.0000.0000.0000%
14RDS DBQlik serverTLSv1.3394Application Data0.0000.0000.0000%
15Qlik serverRDS DBTCP5458313 > 5432 [ACK] Seq=699 Ack=5400 Win=262656 Len=00.0000.0000.0000%
16Qlik serverRDS DBTLSv1.3112Application Data0.0030.0020.0011%
17Qlik serverRDS DBTLSv1.3133Application Data0.0000.0000.0000%
18RDS DBQlik serverTCP605432 > 58313 [ACK] Seq=5400 Ack=836 Win=29032 Len=00.0180.0000.01810%
19RDS DBQlik serverTLSv1.3142Application Data0.0010.008-0.007-4%
20RDS DBQlik serverTLSv1.3135Application Data0.0060.0030.0032%
21Qlik serverRDS DBTCP5458313 > 5432 [ACK] Seq=836 Ack=5569 Win=262400 Len=00.0000.001-0.0010%
22Qlik serverRDS DBTLSv1.3157Application Data0.0050.007-0.002-1%
23RDS DBQlik serverTLSv1.3179Application Data0.0180.0010.01810%
24Qlik serverRDS DBTLSv1.3251Application Data0.0110.0000.0116%
25RDS DBQlik serverTLSv1.3147Application Data0.0180.0000.01811%
26RDS DBQlik serverTLSv1.3433Application Data, Application Data0.0000.0000.0000%
27RDS DBQlik serverTLSv1.398Application Data0.0000.0000.0000%
28Qlik serverRDS DBTCP5458313 > 5432 [ACK] Seq=1136 Ack=6210 Win=261888 Len=00.0000.0000.0000%
29Qlik serverRDS DBTLSv1.393Application Data0.0010.0010.0010%
30RDS DBQlik serverTLSv1.3148Application Data0.0200.0010.01811%
31RDS DBQlik serverTLSv1.398Application Data0.0000.0000.0000%
32Qlik serverRDS DBTCP5458313 > 5432 [ACK] Seq=1175 Ack=6348 Win=261632 Len=00.0000.0000.0000%
33Qlik serverRDS DBTLSv1.381Application Data0.0000.0000.0000%
34Qlik serverRDS DBTLSv1.378Application Data0.0000.0000.0000%
35Qlik serverRDS DBTCP5458313 > 5432 [FIN, ACK] Seq=1226 Ack=6348 Win=261632 Len=00.0000.0000.0000%
36RDS DBQlik serverTCP605432 > 58313 [ACK] Seq=6348 Ack=1226 Win=30104 Len=00.0190.0000.01811%
37RDS DBQlik serverTCP605432 > 58313 [FIN, ACK] Seq=6348 Ack=1227 Win=30104 Len=00.0000.0000.0000%
38Qlik serverRDS DBTCP5458313 > 5432 [ACK] Seq=1227 Ack=6349 Win=261632 Len=00.0000.0000.0000%

The data from the two captures showed a couple of things:

Firstly, both systems had the same number of events captured by Wireshark.  This gives me an indication that there are no networking components from source to destination that is dropping traffic; or doing anything extra unexpected actions to the packet requests. 

I cannot say for sure what is happening on the return trip if there is anything timing out from the AWS side back.

Also, when taking the difference between the OnPrem vs the EC2 server I can see the difference of 18ms keep popping up.  I believe this is the round trip of the connection.  Since this happens multiple times; our latency is compounded into quite a significant value.

What’s next?

I am not a network engineer, so I do not have the knowledge to dive deeper into the Wireshark packets. 

It would be interesting to try the closer AWS data centre to see if the physical distance can help the latency.  But to do this will require effort from the cloud team and the project budget wouldn’t extend to this piece of work.

Our other option is to reduce the number of round trips from our OnPrem server to the AWS datacentre as much as possible.

Qlik Replicate – The saga of replicating to AWS Part 2 – Lions and Tigers and Primary Keys (Oh My!)

The Story so far:

For those who are following from home:

We deduced that replicating through the AWS RDS proxy was a bad idea for performance; but even going directly to the AWS RDS Postgres database from our on-prem system did not give us any noticeable performance gains.

It was time to try looking in different area.

Raw SQL

In the squad investigating the problem, someone asked, “What SQL is Qlik running on the Postgres server?  Is it poorly optimised?”

To find out what Qlik Replicate is running; I bumped the TARGET_APPLY up to Verbose, let transactions run through for a couple of minutes and then assessed the log file.  Since 99% of the changes are updates on the source database (we know this from a production job that is running on the same source database); our logs were full of update statements like this:

UPDATE dbo.my_dest_table
SET	key_1 = ?,
	key_2 = ?,
	key_3 = ?,
	field_1 = ?,
	field_2 = ?,
	field_3 = ?
WHERE
	key_1 = ? AND
	key_2 = ? AND
	key_3 = ?;

The first thing I noticed was – Qlik was updating ALL the columns, even if only one column’s data changed on the source database. 

This also included the Primary key.

Primary keys are computationally expensive updating compared to just your regular day to day columns.  As for an example, I wrote a simple POC in t-sql:

/*
CREATE TABLE dbo.my_dest_table
(
	key_1 SMALLDATETIME,
	key_2 INT,
	key_3 VARCHAR(10),

	field_1 INT,
	field_2 VARCHAR(256),
	field_3 SMALLDATETIME,
	PRIMARY KEY(key_1, key_2, key_3)
)
*/

/*
UPDATE dbo.my_dest_table
SET	key_1 = ?,
	key_2 = ?,
	key_3 = ?,
	field_1 = ?,
	field_2 = ?,
	field_3 = ?
WHERE
	key_1 = ? AND
	key_2 = ? AND
	key_3 = ?;
*/

/*
INSERT INTO dbo.my_dest_table VALUES ('2024-01-01', 1, 'key 3', 1, 2, '2023-03-03');
INSERT INTO dbo.my_dest_table VALUES ('2024-01-01', 2, 'key 3', 2, 3, '2024-04-04');
*/

SET STATISTICS TIME ON;

UPDATE dbo.my_dest_table
SET key_1 = '2024-01-01',
	key_2 = 1,
	key_3 = 'key 3',

	field_1 = 10,
	field_2 = 2,
	field_3 = '2023-03-03'
WHERE
	key_1 = '2024-01-01' AND
	key_2 = 1 AND
	key_3 = 'key 3';
/*
SQL Server parse and compile time: 
   CPU time = 0 ms, elapsed time = 0 ms.

 SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 41 ms.  <------- Oh dear

(1 row affected)
*/

UPDATE dbo.my_dest_table
SET field_1 = 10,
	field_2 = 3,
	field_3 = '2024-04-04'
WHERE
	key_1 = '2024-01-01' AND
	key_2 = 2 AND
	key_3 = 'key 3';
/*
SQL Server parse and compile time: 
   CPU time = 0 ms, elapsed time = 0 ms.

 SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 6 ms.

(1 row affected)
*/

As demonstrated updating the PK columns (even with the same data) is eight times more expensive than just updating the other data columns. 

Asking for help

I asked for help on the Qlik Community forum.  I got a response back from the specialists that there is a setting in Qlik’s endpoints that will NOT update the Primary key.

The setting is:

$info.query_syntax.pk_segments_not_updateable

Set to TRUE

I made this change in the endpoint, restarted the task and captured a few more update sql samples. With the change in the settings; the updates now discarded the updates to the PK.

BUT…

This might be a helpful solution and worthy to note for the future; but in this case the solution did not really help us out.  After making the change we noticed no noticeable change in our TPS throughput; confirming increasingly it was a network issue.

I could see also other problems with this.

The source system that we are reading from is exceptionally large, have been entrenched in our organisation for 30 years and there is constant work done in the background as other core system are merged into it.  We as a downstream user; cannot guarantee the rules around updating Primary Keys.  Gut feeling in day-to-day operation that once the PKs are set, they are set in stone in the database.  But to base our data integrity for critical reporting on accuracy of data on a “Gut feeling” is not a good data integrity model. 

Who to say that the owners of the source system perform an update to a PK to fix an incident; or a new source system consolidation have updates in the steps?  Our change management in the organisation is not tight and detailed enough for us to be forewarned of changes like this.

If data integrity were not so critical; we could run a full refresh periodically to reset the data to a baseline to capture updates of PK.  But this will require significant rework of the solution to manage these periodical refreshes.

So; we did learn.  We did assess.  But this didn’t help us out.

Qlik Replicate – The saga of replicating to AWS Part 1 – pg_bench

If you are new to this story; pick up the introduction from here 

50 TPS – that’s not good…

Our Stress and Volume Testing (SVT) did not turn out that great; which was an understatement. We were aiming for a minimum of 300tps, but we were only achieving 50tps.

Because we could not run the SVT process within business hours as the DB2 databases for production and development shares the same hardware; I had to write a script to simulate the changes on a MS-SQL server.

This means we could use an idle Dev MS-SQL box and run SVT testing whenever we needed it. With the new testing script, I tried various iterations of settings under Change Processing Tuning to see if there was an optimal setting. I could not get above 60tps.

Meanwhile the rest of the team contacted AWS support to see if they can see anything obvious.

AWS Support recommended using the Postgres utility “pg_bench” from various locations to work out if there is a bottle neck between On Prem and AWS RDS. 

They recommend running the benchmark for 15 minutes with different numbers of clients.

Armed with the pg_bench utility; I started performing benchmark runs against the AWS RDS database from various locations in our network, trying to work out where the bottle neck is.

Running the tests


Our network had several locations that we could run pg_bench on to assess the tps to AWS RDS.  To be more specific; to an AWS RDS Proxy.  

To start off we had to learn how to use the command first.  The benchmark database must be initialised with some tables and data.  This is done with the following command:

pgbench --host rds-database-proxy.aws.com --port 5432 --username qlikreplicate -i -s 50 dest_db

Once done; a 15min benchmark was run with the following command:

pgbench --host rds-database-proxy.aws.com --port 5432 --username qlikreplicate -l -c 2 -j 2 -T 900 dest_db

Parameters:

ParameterDescription
--hostAddress of the database/proxy
--portDatabase port
--usernameUsername to connect. Must have permissions to create objects and insert data. Since it was a Dev database I used the Qlik Replicate username
-lWrite information about each transaction to a log. not really useful and in future I would omit it
-cNumber of clients to simulate. I used 2 to replicate a couple of QR tasks
-jNumber of threads to use
-TTime in seconds to run the test for. 900 seconds = 15 minutes

Four tests were conducted:

  1. The OnPrem Qlik Replicate server to AWS RDS Proxy
  2. An EC2 machine to AWS RDS Proxy
  3. The OnPrem Qlik Replicate server to an OnPrem Postgres database.  (Actually to our Enterprise manager analytic database.  So it is not a fancy highly optimised database)
  4. An EC2 machine directly to the AWS RDS database

Why a AWS RDS Proxy?  I’m not sure of the story behind why there was a proxy between Qlik and RDS; rumours it was because the security team didn’t initially allow our On Prem services to connect directly to cloud database.  They insisted that a proxy between the services and databses.

Results

The pg_bench tests revealed several learnings to the team. But unfortunately, there were no answers to our QR latency problem.

It was a bit of grim relief it proved that QR was not the cause of our latency problem; there is a component between us and AWS RDS database that is causing the issue.

Even more unfortunate when I raise the issue to the network team thinking that there might be an IPS, QOS device that might be slowing us down; they could not find anything. I have been informed that the bandwidth between our On Prem and AWS is broad and fast and there should not be any obvious chocking the tps. Grumpily I mused that households can now stream multiple 4K videos at home without a frame dropped while we can’t send a few database changes from our datacentre to an AWS database at reasonable speeds

Interestingly the RDS proxy was a significant bottleneck when we started questioning its purpose in the pipeline. Even the AWS support team was scratching their heads of the purpose of it in the use case we had. Later on, I tried OnPrem to the direct RDS; but that made no difference to the tps. 

Lessons learnt

  • Validate poor latency through Qlik Replicate via another tool.  In this instance pg_bench took QR right out of the equation
  • Don’t use AWS RDS proxy for Qlik Replicate unless you have a critical use case.  Performance test thoroughly your design choices

Qlik Replicate – The saga of replicating to AWS (Intro)

I am exhausted from the past couple of weeks; even getting close to a month of working on a new Qlik Replicate tasks that we are getting no luck with performance.

The upside I have learned many new concepts and techniques in Qlik Replicate, Postgres, MS-SQL and networking.
The downside is that (at the time of writing) we have not solved the problem, and a lot of effort hasn’t returned anything tangible for the business to use.

This post will be an introduction to the problem and the next few posts will dive into the individual areas of QR quirks, diagnostic methods and performance turning that I used.

Background

We have a big expensive DB2 Z/OS database running on premise that have our core account data in hosted in it. A logical account will be divided to two different tables; slowly changing data (like open date, close date) in one table and constant changing data (like account balance) in another table.

Downstream users want these two tables stitched together and “Change Data Capture” implemented on this combined table. The initial thought was to create triggers in DB2 so when data is updated on the “Slowly Changing Data Table” (SCDT) or the “Constant Changing Data Table” (CCDT); the triggers act on a “Combined Table” (CT).

This solution was deemed too expensive by the business; the triggers would drag down the performance of an already stressed DB2 system and to purchase more processing power was not in the viable.

The architects thought, “OK – why don’t we replicate the SCDT and CCDT to AWS RDS Postgres and build the triggers in there? We can then scale the Postgres database cheaply.”

Sound thinking. We did a quick POC that a downstream database can be triggered by data flowing into tables from Qlik Replicate and it worked as expected.

The project jumped into build mode and soon we had development done and testing on the way.

Then it was time to do the “Stress and Volume” testing. 

That is when things started to unravel…

High Hopes but low TPS

Since the Development DB2 Z/OS database shares the same hardware as Production; we had to conduct our Stress and Volume Testing (SVT) on a weekend, so our customers were not adversely affected.

Two test runs were conducted; on SVT run without the database triggers enabled and another with them enabled.

The results were not good. Our SVT SME wrote in their summary:

We ran two load test on Dec 9th Saturday – first test with database trigger OFF, second test with trigger ON.

Each test ran for about 1 hour with total 700k+ transactions at throughput 170tps to 190tps.

  1. During first test with trigger OFF, Qlik process T007 had throughput at 50tps, well below other Qlik processes at 180tps. This suggests the performance bottleneck might not be on the database trigger.During second test with trigger ON, Qlik process T007/T008 had a slightly lower throughput at 40tps, which means the database trigger adds a little bit more latency.

    In the middle of 2nd test, Qlik Enterprise Manager became broken and did not display performance graphs properly, hence I can’t provide you graphs till Jon fixes that.

  2. After the 1-hour test stopped, it took another 3.5 hours for Qlik process T007/T008 to finish its backlog. That means T007/T008 need 4.5 hour in total to complete DB2 data produced in each hour of test. Maths tells us 40:180 = 1:4.5. In other words, the “producer” (DB2 Database) works 4.5 times faster than the “consumer” (Qlik T007/T008).
  3. I notice the memory usage of DB2 related Qlik processes kept going up during test. Not sure if that is the expected behaviour.

Since our bare baseline that we decided internally was at 300tps; this was not looking good.

What lead next was a journey of performance diagnosing with may trials, tribulations, dead ends and a final way forward.

Header Illustration Credit