Qlik Replicate

Qlik Replicate – The saga of replicating to AWS Part 4 – Does stream size matter?

Continuing our ongoing Qlik Replicating story of trying to replicate a DB2/zOS database to AWS RDS Postgres.

We made small improvements; but nothing substantial to reach the TPS for which we were aiming. I was at my experience end of what I knew and decided to reach for professional help.

We have a support relationship with IBT; who helped us out with the initial set up QR in our organisation. But recently we have been self-resolving our own problems and have not been using their help. Now this was suitable time to ask for their help.

IBT has always been helpful when we have asked for assistance. Another handy aspect with the relationship is that IBT has a quick support relationship with Qlik. If they don’t know the answer; they can get the answer easily from Qlik.

IBT asked us to collect the usual data; diagnostic logs, source and target DB metrics and QR server core metrics. Nothing looked under duress, so IBT dived into the nitty and gritty details of the diagnostic packs.

Their techs noticed that our outgoing stream buffers were full. This means that the changes were coming in faster than were getting sent out to the destination. IBT suggested to try increasing the size of the outgoing stream.

Without going through the details of this step; here is a Qlik knowledge base article on Increasing the outgoing stream queue.

“She’s breaking up – I can’t hold her”

We started off with:

"stream_buffers_number" : 5,
"stream_buffer_size" : 10,

It was a marginal improvement. Measured in a thimble full of performance improvement. Still nowhere near the TPS we needed.

IBT asked us to increase the two variables in small increments of “stream_buffers_number” + 5 and “stream_buffer_size” + 10. With each increase there was a minuscule improvement.

But more worrying with each increase; the QR task was using more memory to the point that increasing the buffer size was unsustainable with the resources on the server. Even if increasing the buffer variables and the gained TPS was linear relationship; we would need a very beefy server to reach 300 TPS.

So again, it was a little gain; and with all the added “Little gains” over the past few fix iterations we were still no closer to our needed 300 TPS.

Increasing the buffer variables might be helpful if you are close to your TPS and trying to get over the last hurdle. But since we’re so far behind; we had to look for another solution.

May 15, 2024 by jonny.donker@gmail.com Qlik Replicate 0

Qlik Replicate – The saga of replicating to AWS Part 3 – Wireshark!

Continuing on the story

After concluding that the low TPS is not resulting from poor query performance; our attention was turned to the network latency between our OnPrem Qlik system and the AWS RDS database.

First, I asked the networks team if there were any suspect networking components between our on-premise’s Qlik server and the AWS DB. Anything like IPS, QOS, bandwidth limitation components that could explain the slowdown.

I also asked the cloud team if they can find anything as well.

It was a high hope for them to find anything; but since they are the SMEs in the area, it was worth asking the question.

As expected, they did not find anything.

But the Network team did come back with a couple of pieces of information:

The network bandwidth to the AWS was wide enough and we were not reaching its capacity.
It is a 16ms – 20ms round trip from our Data centre to the AWS data centre.

Loaction… Location…

Physically the distance to the AWS data centre is 700Km.

Unfortunately, AWS set up a closer data centre in the past few years, which is only 130Km away. We are not currently set up to use this new region yet.

The Network team gave me permission to install wire shark on our OnPrem Qlik server and our AWS EC2 Qlik server.

From both servers with psql I connected to the AWS RDS database and updated one row; capturing the traffic using Wireshark.

I lined up the two results from the different servers to see if there was anything obvious

Wireshark results

(ip.src == ip.of.qlik.server and ip.dst == ip.of.aws.rds) or (ip.src == ip.of.aws.rds and ip.dst == ip.of.qlik.server)

SEQ	Source	Destination	Protocol	Length	Info	On Prem 2 RDS	EC2 2 RDS	Difference (sec)	% of difference
1	Qlik server	RDS DB	TCP	66	58313 > 5432 [SYN, ECE, CWR] Seq=0 Win=64240 Len=0 MSS=1460 WS=256 SACK_PERM	0	0	0.000	0%
2	RDS DB	Qlik server	TCP	66	5432 > 58313 [SYN, ACK] Seq=0 Ack=1 Win=26883 Len=0 MSS=1460 SACK_PERM WS=8	0.019	0.001	0.018	10%
3	Qlik server	RDS DB	TCP	54	58313 > 5432 [ACK] Seq=1 Ack=1 Win=262656 Len=0	0.000	0.000	0.000	0%
4	Qlik server	RDS DB	PGSQL	62	>?	0.000	0.005	-0.005	-3%
5	RDS DB	Qlik server	TCP	60	5432 > 58313 [ACK] Seq=1 Ack=9 Win=26888 Len=0	0.018	0.000	0.018	10%
6	RDS DB	Qlik server	PGSQL	60	<	0.001	0.001	0.000	0%
7	Qlik server	RDS DB	TLSv1.3	343	Client Hello	0.004	0.004	0.001	0%
8	RDS DB	Qlik server	TLSv1.3	220	Hello Retry Request	0.021	0.001	0.021	12%
9	Qlik server	RDS DB	TLSv1.3	455	Change Cipher Spec, Client Hello	0.003	0.001	0.002	1%
10	RDS DB	Qlik server	TLSv1.3	566	Server Hello, Change Cipher Spec	0.023	0.005	0.019	11%
11	RDS DB	Qlik server	TCP	1514	5432 > 58313 [ACK] Seq=680 Ack=699 Win=29032 Len=1460 [TCP segment of a reassembled PDU]	0.000	0.000	0.000	0%
12	RDS DB	Qlik server	TCP	1514	5432 > 58313 [ACK] Seq=2140 Ack=699 Win=29032 Len=1460 [TCP segment of a reassembled PDU]	0.000	0.000	0.000	0%
13	RDS DB	Qlik server	TCP	1514	5432 > 58313 [ACK] Seq=3600 Ack=699 Win=29032 Len=1460 [TCP segment of a reassembled PDU]	0.000	0.000	0.000	0%
14	RDS DB	Qlik server	TLSv1.3	394	Application Data	0.000	0.000	0.000	0%
15	Qlik server	RDS DB	TCP	54	58313 > 5432 [ACK] Seq=699 Ack=5400 Win=262656 Len=0	0.000	0.000	0.000	0%
16	Qlik server	RDS DB	TLSv1.3	112	Application Data	0.003	0.002	0.001	1%
17	Qlik server	RDS DB	TLSv1.3	133	Application Data	0.000	0.000	0.000	0%
18	RDS DB	Qlik server	TCP	60	5432 > 58313 [ACK] Seq=5400 Ack=836 Win=29032 Len=0	0.018	0.000	0.018	10%
19	RDS DB	Qlik server	TLSv1.3	142	Application Data	0.001	0.008	-0.007	-4%
20	RDS DB	Qlik server	TLSv1.3	135	Application Data	0.006	0.003	0.003	2%
21	Qlik server	RDS DB	TCP	54	58313 > 5432 [ACK] Seq=836 Ack=5569 Win=262400 Len=0	0.000	0.001	-0.001	0%
22	Qlik server	RDS DB	TLSv1.3	157	Application Data	0.005	0.007	-0.002	-1%
23	RDS DB	Qlik server	TLSv1.3	179	Application Data	0.018	0.001	0.018	10%
24	Qlik server	RDS DB	TLSv1.3	251	Application Data	0.011	0.000	0.011	6%
25	RDS DB	Qlik server	TLSv1.3	147	Application Data	0.018	0.000	0.018	11%
26	RDS DB	Qlik server	TLSv1.3	433	Application Data, Application Data	0.000	0.000	0.000	0%
27	RDS DB	Qlik server	TLSv1.3	98	Application Data	0.000	0.000	0.000	0%
28	Qlik server	RDS DB	TCP	54	58313 > 5432 [ACK] Seq=1136 Ack=6210 Win=261888 Len=0	0.000	0.000	0.000	0%
29	Qlik server	RDS DB	TLSv1.3	93	Application Data	0.001	0.001	0.001	0%
30	RDS DB	Qlik server	TLSv1.3	148	Application Data	0.020	0.001	0.018	11%
31	RDS DB	Qlik server	TLSv1.3	98	Application Data	0.000	0.000	0.000	0%
32	Qlik server	RDS DB	TCP	54	58313 > 5432 [ACK] Seq=1175 Ack=6348 Win=261632 Len=0	0.000	0.000	0.000	0%
33	Qlik server	RDS DB	TLSv1.3	81	Application Data	0.000	0.000	0.000	0%
34	Qlik server	RDS DB	TLSv1.3	78	Application Data	0.000	0.000	0.000	0%
35	Qlik server	RDS DB	TCP	54	58313 > 5432 [FIN, ACK] Seq=1226 Ack=6348 Win=261632 Len=0	0.000	0.000	0.000	0%
36	RDS DB	Qlik server	TCP	60	5432 > 58313 [ACK] Seq=6348 Ack=1226 Win=30104 Len=0	0.019	0.000	0.018	11%
37	RDS DB	Qlik server	TCP	60	5432 > 58313 [FIN, ACK] Seq=6348 Ack=1227 Win=30104 Len=0	0.000	0.000	0.000	0%
38	Qlik server	RDS DB	TCP	54	58313 > 5432 [ACK] Seq=1227 Ack=6349 Win=261632 Len=0	0.000	0.000	0.000	0%

The data from the two captures showed a couple of things:

Firstly, both systems had the same number of events captured by Wireshark. This gives me an indication that there are no networking components from source to destination that is dropping traffic; or doing anything extra unexpected actions to the packet requests.

I cannot say for sure what is happening on the return trip if there is anything timing out from the AWS side back.

Also, when taking the difference between the OnPrem vs the EC2 server I can see the difference of 18ms keep popping up. I believe this is the round trip of the connection. Since this happens multiple times; our latency is compounded into quite a significant value.

What’s next?

I am not a network engineer, so I do not have the knowledge to dive deeper into the Wireshark packets.

It would be interesting to try the closer AWS data centre to see if the physical distance can help the latency. But to do this will require effort from the cloud team and the project budget wouldn’t extend to this piece of work.

Our other option is to reduce the number of round trips from our OnPrem server to the AWS datacentre as much as possible.

May 6, 2024 by jonny.donker@gmail.com Qlik Replicate 0

Qlik Replicate – The saga of replicating to AWS Part 2 – Lions and Tigers and Primary Keys (Oh My!)

The Story so far:

For those who are following from home:

We deduced that replicating through the AWS RDS proxy was a bad idea for performance; but even going directly to the AWS RDS Postgres database from our on-prem system did not give us any noticeable performance gains.

It was time to try looking in different area.

Raw SQL

In the squad investigating the problem, someone asked, “What SQL is Qlik running on the Postgres server? Is it poorly optimised?”

To find out what Qlik Replicate is running; I bumped the TARGET_APPLY up to Verbose, let transactions run through for a couple of minutes and then assessed the log file. Since 99% of the changes are updates on the source database (we know this from a production job that is running on the same source database); our logs were full of update statements like this:

UPDATE dbo.my_dest_table
SET	key_1 = ?,
	key_2 = ?,
	key_3 = ?,
	field_1 = ?,
	field_2 = ?,
	field_3 = ?
WHERE
	key_1 = ? AND
	key_2 = ? AND
	key_3 = ?;

The first thing I noticed was – Qlik was updating ALL the columns, even if only one column’s data changed on the source database.

This also included the Primary key.

Primary keys are computationally expensive updating compared to just your regular day to day columns. As for an example, I wrote a simple POC in t-sql:

/*
CREATE TABLE dbo.my_dest_table
(
	key_1 SMALLDATETIME,
	key_2 INT,
	key_3 VARCHAR(10),

	field_1 INT,
	field_2 VARCHAR(256),
	field_3 SMALLDATETIME,
	PRIMARY KEY(key_1, key_2, key_3)
)
*/

/*
UPDATE dbo.my_dest_table
SET	key_1 = ?,
	key_2 = ?,
	key_3 = ?,
	field_1 = ?,
	field_2 = ?,
	field_3 = ?
WHERE
	key_1 = ? AND
	key_2 = ? AND
	key_3 = ?;
*/

/*
INSERT INTO dbo.my_dest_table VALUES ('2024-01-01', 1, 'key 3', 1, 2, '2023-03-03');
INSERT INTO dbo.my_dest_table VALUES ('2024-01-01', 2, 'key 3', 2, 3, '2024-04-04');
*/

SET STATISTICS TIME ON;

UPDATE dbo.my_dest_table
SET key_1 = '2024-01-01',
	key_2 = 1,
	key_3 = 'key 3',

	field_1 = 10,
	field_2 = 2,
	field_3 = '2023-03-03'
WHERE
	key_1 = '2024-01-01' AND
	key_2 = 1 AND
	key_3 = 'key 3';
/*
SQL Server parse and compile time: 
   CPU time = 0 ms, elapsed time = 0 ms.

 SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 41 ms.  <------- Oh dear

(1 row affected)
*/

UPDATE dbo.my_dest_table
SET field_1 = 10,
	field_2 = 3,
	field_3 = '2024-04-04'
WHERE
	key_1 = '2024-01-01' AND
	key_2 = 2 AND
	key_3 = 'key 3';
/*
SQL Server parse and compile time: 
   CPU time = 0 ms, elapsed time = 0 ms.

 SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 6 ms.

(1 row affected)
*/

As demonstrated updating the PK columns (even with the same data) is eight times more expensive than just updating the other data columns.

Asking for help

I asked for help on the Qlik Community forum. I got a response back from the specialists that there is a setting in Qlik’s endpoints that will NOT update the Primary key.

The setting is:

$info.query_syntax.pk_segments_not_updateable

Set to TRUE

I made this change in the endpoint, restarted the task and captured a few more update sql samples. With the change in the settings; the updates now discarded the updates to the PK.

BUT…

This might be a helpful solution and worthy to note for the future; but in this case the solution did not really help us out. After making the change we noticed no noticeable change in our TPS throughput; confirming increasingly it was a network issue.

I could see also other problems with this.

The source system that we are reading from is exceptionally large, have been entrenched in our organisation for 30 years and there is constant work done in the background as other core system are merged into it. We as a downstream user; cannot guarantee the rules around updating Primary Keys. Gut feeling in day-to-day operation that once the PKs are set, they are set in stone in the database. But to base our data integrity for critical reporting on accuracy of data on a “Gut feeling” is not a good data integrity model.

Who to say that the owners of the source system perform an update to a PK to fix an incident; or a new source system consolidation have updates in the steps? Our change management in the organisation is not tight and detailed enough for us to be forewarned of changes like this.

If data integrity were not so critical; we could run a full refresh periodically to reset the data to a baseline to capture updates of PK. But this will require significant rework of the solution to manage these periodical refreshes.

So; we did learn. We did assess. But this didn’t help us out.

May 5, 2024 by jonny.donker@gmail.com Qlik Replicate 0

Qlik Replicate – The saga of replicating to AWS Part 1 – pg_bench

If you are new to this story; pick up the introduction from here

50 TPS – that’s not good…

Our Stress and Volume Testing (SVT) did not turn out that great; which was an understatement. We were aiming for a minimum of 300tps, but we were only achieving 50tps.

Because we could not run the SVT process within business hours as the DB2 databases for production and development shares the same hardware; I had to write a script to simulate the changes on a MS-SQL server.

This means we could use an idle Dev MS-SQL box and run SVT testing whenever we needed it. With the new testing script, I tried various iterations of settings under Change Processing Tuning to see if there was an optimal setting. I could not get above 60tps.

Meanwhile the rest of the team contacted AWS support to see if they can see anything obvious.

AWS Support recommended using the Postgres utility “pg_bench” from various locations to work out if there is a bottle neck between On Prem and AWS RDS.

They recommend running the benchmark for 15 minutes with different numbers of clients.

Armed with the pg_bench utility; I started performing benchmark runs against the AWS RDS database from various locations in our network, trying to work out where the bottle neck is.

Running the tests

Our network had several locations that we could run pg_bench on to assess the tps to AWS RDS. To be more specific; to an AWS RDS Proxy.

To start off we had to learn how to use the command first. The benchmark database must be initialised with some tables and data. This is done with the following command:

pgbench --host rds-database-proxy.aws.com --port 5432 --username qlikreplicate -i -s 50 dest_db

Once done; a 15min benchmark was run with the following command:

pgbench --host rds-database-proxy.aws.com --port 5432 --username qlikreplicate -l -c 2 -j 2 -T 900 dest_db

Parameters:

Parameter	Description
`--host`	Address of the database/proxy
`--port`	Database port
`--username`	Username to connect. Must have permissions to create objects and insert data. Since it was a Dev database I used the Qlik Replicate username
`-l`	Write information about each transaction to a log. not really useful and in future I would omit it
`-c`	Number of clients to simulate. I used 2 to replicate a couple of QR tasks
`-j`	Number of threads to use
`-T`	Time in seconds to run the test for. 900 seconds = 15 minutes

Four tests were conducted:

The OnPrem Qlik Replicate server to AWS RDS Proxy
An EC2 machine to AWS RDS Proxy
The OnPrem Qlik Replicate server to an OnPrem Postgres database. (Actually to our Enterprise manager analytic database. So it is not a fancy highly optimised database)
An EC2 machine directly to the AWS RDS database

Why a AWS RDS Proxy? I’m not sure of the story behind why there was a proxy between Qlik and RDS; rumours it was because the security team didn’t initially allow our On Prem services to connect directly to cloud database. They insisted that a proxy between the services and databses.

Results

The pg_bench tests revealed several learnings to the team. But unfortunately, there were no answers to our QR latency problem.

It was a bit of grim relief it proved that QR was not the cause of our latency problem; there is a component between us and AWS RDS database that is causing the issue.

Even more unfortunate when I raise the issue to the network team thinking that there might be an IPS, QOS device that might be slowing us down; they could not find anything. I have been informed that the bandwidth between our On Prem and AWS is broad and fast and there should not be any obvious chocking the tps. Grumpily I mused that households can now stream multiple 4K videos at home without a frame dropped while we can’t send a few database changes from our datacentre to an AWS database at reasonable speeds

Interestingly the RDS proxy was a significant bottleneck when we started questioning its purpose in the pipeline. Even the AWS support team was scratching their heads of the purpose of it in the use case we had. Later on, I tried OnPrem to the direct RDS; but that made no difference to the tps.

Lessons learnt

Validate poor latency through Qlik Replicate via another tool. In this instance pg_bench took QR right out of the equation
Don’t use AWS RDS proxy for Qlik Replicate unless you have a critical use case. Performance test thoroughly your design choices

February 24, 2024 by jonny.donker@gmail.com Qlik Replicate 0

Qlik Replicate – The saga of replicating to AWS (Intro)

I am exhausted from the past couple of weeks; even getting close to a month of working on a new Qlik Replicate tasks that we are getting no luck with performance.

The upside I have learned many new concepts and techniques in Qlik Replicate, Postgres, MS-SQL and networking.
The downside is that (at the time of writing) we have not solved the problem, and a lot of effort hasn’t returned anything tangible for the business to use.

This post will be an introduction to the problem and the next few posts will dive into the individual areas of QR quirks, diagnostic methods and performance turning that I used.

Background

We have a big expensive DB2 Z/OS database running on premise that have our core account data in hosted in it. A logical account will be divided to two different tables; slowly changing data (like open date, close date) in one table and constant changing data (like account balance) in another table.

Downstream users want these two tables stitched together and “Change Data Capture” implemented on this combined table. The initial thought was to create triggers in DB2 so when data is updated on the “Slowly Changing Data Table” (SCDT) or the “Constant Changing Data Table” (CCDT); the triggers act on a “Combined Table” (CT).

This solution was deemed too expensive by the business; the triggers would drag down the performance of an already stressed DB2 system and to purchase more processing power was not in the viable.

The architects thought, “OK – why don’t we replicate the SCDT and CCDT to AWS RDS Postgres and build the triggers in there? We can then scale the Postgres database cheaply.”

Sound thinking. We did a quick POC that a downstream database can be triggered by data flowing into tables from Qlik Replicate and it worked as expected.

The project jumped into build mode and soon we had development done and testing on the way.

Then it was time to do the “Stress and Volume” testing.

That is when things started to unravel…

High Hopes but low TPS

Since the Development DB2 Z/OS database shares the same hardware as Production; we had to conduct our Stress and Volume Testing (SVT) on a weekend, so our customers were not adversely affected.

Two test runs were conducted; on SVT run without the database triggers enabled and another with them enabled.

The results were not good. Our SVT SME wrote in their summary:

We ran two load test on Dec 9^th Saturday – first test with database trigger OFF, second test with trigger ON.

Each test ran for about 1 hour with total 700k+ transactions at throughput 170tps to 190tps.

During first test with trigger OFF, Qlik process T007 had throughput at 50tps, well below other Qlik processes at 180tps. This suggests the performance bottleneck might not be on the database trigger.During second test with trigger ON, Qlik process T007/T008 had a slightly lower throughput at 40tps, which means the database trigger adds a little bit more latency.
In the middle of 2^nd test, Qlik Enterprise Manager became broken and did not display performance graphs properly, hence I can’t provide you graphs till Jon fixes that.

After the 1-hour test stopped, it took another 3.5 hours for Qlik process T007/T008 to finish its backlog. That means T007/T008 need 4.5 hour in total to complete DB2 data produced in each hour of test. Maths tells us 40:180 = 1:4.5. In other words, the “producer” (DB2 Database) works 4.5 times faster than the “consumer” (Qlik T007/T008).

I notice the memory usage of DB2 related Qlik processes kept going up during test. Not sure if that is the expected behaviour.

Since our bare baseline that we decided internally was at 300tps; this was not looking good.

What lead next was a journey of performance diagnosing with may trials, tribulations, dead ends and a final way forward.

Header Illustration Credit

February 23, 2024 by jonny.donker@gmail.com Qlik Replicate 3

Qlik Replicate – Where’s my slots gone?

Qlik Replicate is growing rapidly in our organisation.

Starting with one “Proof of concept” server in DTL; we are now up to seventeen. A mixture of development, test, performance testing servers on OnPrem and in the AWS and GCS cloud; depending where the source database is located.

With trying to keep things neat and organised; we are constantly moving QR tasks from one server to another.

Anyway the Juniors came to me with the following error message when they were shifting a QR task with a AWS RDS postgres source:

2024-01-31T16:39:38 [SOURCE_CAPTURE ]E: Can't resume task after replication slot was dropped. [1020101] (postgres_endpoint_capture.c:444)

From my understanding (I wasn’t babysitting their process); they exported the old task out with Endpoints and then reimported it into the new server. They re-entered the passwords for the end points and then try to resume from the stream position of the old task.

I couldn’t fault that logic.

As a test we cloned that task on the same server and made the target endpoint as a NULL.

With this test task we could

Start the task from scratch
Resume it from the stream position that the old task got up to

This proved that the new server had no issues contacting the database and resuming from a position on the postgres database.

In the end I guessed that there was a problem with the replication slot (or lack of) for that task on the postgres server; not sure specifically but it was looking for something that it didn’t exist.

I worked out the following steps to resolve the problem:

Change the task’s target endpoint to a NULL connector.
Start the task with the “Reload Target” option.
Wait until the task has done the “Full load” and has entered CDC mode.
Stop the task and change the target endpoint back to the original end point.
Start the task again with “Advance Run Options” and resume from the “Source change position” of the original taks.

The task should now be happy

January 31, 2024 by jonny.donker@gmail.com Qlik Replicate 0

Qlik Replicate – The million row challenge with

It seems every new project; the project teams come with new and innovative ways to make my life “interesting” while managing Qlik Replicate.

The latest idea they have is to pretty much request that one of our major Qlik Replicate tasks that is sucking data our of a DB2 database and delivering data to AWS S3 be duplicated and deliver pretty much the same data to GCS.

When I got wind of this proposal it was already too far into the solution design to propose another solution; like copying the data from GCS to AWS S3. So, it was left up to me to go and tell the owners of the DB2 system that there is another massive feed will be built of their database – basically doing the same thing.

With some persuasion I could get internal funding to research and implement Log Streaming to consolidate our reads from the DB2 system. Hardware wise this required adding more memory and disk space to the QR servers to handle the log streaming.

As for the Log stream component of the solution, I started looking into the settings. Since we are drawing a lot of data from DB2 – I am concerned about disk space and what performance impact the compression setting has on the task.

To test the impact; I set up the following test scenario:

The "Million row" challenge

To test – I created a very wide table in a test MS-SQL (a database I had access to); which contained integers and varchars.

With a python script, a million rows were inserted into the table in batches of 10,000. In the end the million rows equated to about 1.3Gb on the database.

I did ten test runs, increasing the compression in the log stream endpoint with each individual run.

Using Qlik Replicate’s analytics I could then get the metrics for the Log Stream’s task; and a task that is reading from the log stream and outputting the data to a NULL endpoint.

I also grabbed the file sizes that the log stream created on the server’s drive.

Here are the results:

	Log Stream File Size (Mb)	LS – CPU %	LS – Memory (MB)	LS – Source latency (sec)	LS – Target latency (sec)	Target – CPU %	Target – Memory (MB)	Target – Source latency (sec)	Target – Target latency (sec)
No compression	1,974	7	134	1	3	4	126	3	4
Level 2	718	15	169	2	4	4	105	3	3
Level 3	690	14	156	0	3	4	118	2	2
Level 4	683	16	155	1	2	6	127	2	2
Level 5	675	21	281	1	5	6	156	3	3
Level 6	665	24	224	1	3	5	176	4	4
Level 7	663	23	182	1	3	4	160	4	4
Level 8	657	38	394	1	9	4	181	9	9
Level 9	657	46	345	1	10	6	189	12	12

Some notes on the results

The run was done on a shared server; so other processes on the server running at the same time could impact the CPU and memory metrics
Analytics only takes a snapshot of the metrics every 5min and the million rows are inserted into the database in less than <10min so we can only get one snapshot of the metrics for the run. Depending where the run was up to; this could impact the metrics

Conclusions

Like most things in IT; there is no magical setting that will cover all scenarios. Everything is a compromise, and the settings will need to be adjusted to meet one’s particular source endpoint, nature of the data being replicated, and the latency can be tolerated to disk space usage.

Here is what I learnt.

If you require absolute minimum latency as possible – log steaming is not your answer for a couple of reasons.
1. There is an inherited one second delay writing batches from the source system to the log stream.
2. All the data is written to disk and then read by the target QR task. This means that the changes do not reside in memory and need to be read from disk.
Level 2 significantly reduces disk space usage with little compromise to speed and resource usage.
Level 4 is the best compromise of disk space, speed, and resource usage. Anything above level 4 starts to strain the speed and resource usage with little improvement to disk space.

Implementing Log streaming plan

For my scenario – I think Level 2 is the way to go and will stem my testing out from there.

My implementation plan is to implement the log streaming task into production and let it run by itself for a couple of weeks.

During that time, I will monitor how the log stream task behaves with large batches and how much data it writes to disk as we cannot generate that amount data in our performance testing environments.

Once the log stream task is settled; then I will start migrating the existing Qlik Replicate tasks to read from the log stream task stream instead of directly from the database.

November 19, 2023 by jonny.donker@gmail.com Qlik Replicate 0

Qlik Replicate Expressions – Wait. That’s Illegal

A new team will be taking over for out of hours oncall for Qlik Replicate and I tasked with training up a new team to diagnose and resolve problems with QR.

I thought it was a good idea to create some practice QR tasks in our development environment; intentionally put some problems in the tasks and let the new team diagnose the problem in a safe and no pressure scenarios.

I did the obvious ones; like put an incorrect password in an end point, put a typo in a filter criteria.

Looking for another scenario I thought – “Nothing better than a divide by zero error.”

Wait. That’s Illegal – Part 1

I created a simple table; with a field of CURR_BAL = 1000 and ASSET_VALUE = 0 and added a new field call LVR with the expression of:

$CURR_BALANCE / $ASSET_VALUE

Ran a full load expecting lots of nice and nasty error messages for the new team to investigate.

Hmm – no errors.

That’s not what I expected.

All the rows were transferred across and there was no errors or warnings in the log file, no data errors or anything.

I bumped up all the logging to “Trace” and reran the full load. Once again, no error messages.

Redirecting the output from the task to a kafka topic; I notice the records that had an intentional div/0 scenario was coming through as null.

{
  "magic": "atMSG",
  "type": "DT",
  "headers": null,
  "messageSchemaId": null,
  "messageSchema": null,
  "message": {
    "data": {
      "ACCOUNT_ID": 5,
      "ACCOUNT_NAME": "Elissa",
      "CURR_BALANCE": "10000.00",
      "ASSET_VALUE": "0.00",
      "LVR": null
    },
    "beforeData": null,
    "headers": {
      "operation": "REFRESH",
      "changeSequence": "",
      "timestamp": "",
      "streamPosition": "",
      "transactionId": "",
      "changeMask": null,
      "columnMask": null,
      "transactionEventCounter": null,
      "transactionLastEvent": null
    }
  }
}

Then I remembered – CDC expressions are built on sqlite syntax. Looking up how sqlite handles div/0 scenarios; by default sqlite will return a null instead of failing the query on a div/0.

Quoting from Sqlite – Difference Between engines

SQLite does not raise arithmetic exceptions (eg. divide by zero, 1/0). SQLite returns a NULL value for 1/0.

Wait. That’s Illegal – Part 2

I got past my div/0 conundrum after the explanation of what was happening and had a look at a “valid” record.

{
    "magic": "atMSG",
    "type": "DT",
    "headers": null,
    "messageSchemaId": null,
    "messageSchema": null,
    "message": {
        "data": {
            "ACCOUNT_ID": 1,
            "ACCOUNT_NAME": "Alice",
            "CURR_BALANCE": "12345.00",
            "ASSET_VALUE": "500000.00",
            "LVR": 0        &lt;------------- Huh?
        },
        "beforeData": null,
        "headers": {
            "operation": "REFRESH",
            "changeSequence": "",
            "timestamp": "",
            "streamPosition": "",
            "transactionId": "",
            "changeMask": null,
            "columnMask": null,
            "transactionEventCounter": null,
            "transactionLastEvent": null
        }
    }
}

Again our LVR value was wrong.

12345.00 / 500000.00 should be 0.02469; instead of 0

A couple of things popped to mind.

It looked like the classic integer division issue that always catches out new (and experienced) programmer.

But that didn’t make any sense since the source fields were both numeric(18,2).

I checked the destination field that I added.

It was set to numeric(18,2), which isn’t precise enough for the LVR value but we still should have got 0.2 instead 0.

OK – QR must not like the “custom” precision of numeric(18,2) for the LVR field.

I tried forcing the LVR field data type to REAL(4).

Still got the incorrect value for LVR

Maybe for some reason QR /sqlite treats the values coming in as an integer; even though the source data type is numeric(18,2) on the database.

Let’s give this a go:

CAST($CURR_BALANCE AS NUMERIC) / $ASSET_VALUE

^^ Nope

CAST($CURR_BALANCE AS NUMERIC) / CAST($ASSET_VALUE AS NUMERIC)

^^ Long shot but nope

CAST($CURR_BALANCE AS REAL) / $ASSET_VALUE

^^ At last success

{
    "magic": "atMSG",
    "type": "DT",
    "headers": null,
    "messageSchemaId": null,
    "messageSchema": null,
    "message": {
        "data": {
            "ACCOUNT_ID": 1,
            "ACCOUNT_NAME": "Alice",
            "CURR_BALANCE": "12345.00",
            "ASSET_VALUE": "500000.00",
            "LVR": "0.0246"  &lt;------------- Finally!
        },
        "beforeData": null,
        "headers": {
            "operation": "REFRESH",
            "changeSequence": "",
            "timestamp": "",
            "streamPosition": "",
            "transactionId": "",
            "changeMask": null,
            "columnMask": null,
            "transactionEventCounter": null,
            "transactionLastEvent": null
        }
    }
}

It works – but forcing a precise of numeric value to a floating point data type REAL leaves me uneasy; especially in a finance world were precision is needed.

I haven’t researched or tested whether this is a risk of generating an imprecise number through a division or not; something I will need to put on my todo list.

Conclusion

My design principle for Qlik Replicate tasks is to keep them as simple as possible.

Grab the data off the source and get it to the destination as quickly as possible.

By design principle I am against putting expressions in QR tasks:

It is not as feature rich as other programming languages
It hides business rules that adds a black box and complexity to the pipeline

Point 2 was the big one for me. I didn’t want downstream users messaging me every time there was an abnormality in the data. With no expressions in the QR tasks – I could rightfully say, “QR just grabs what it finds on the source database – go and talk to them”

Now with the discovery of these little “features” that could come from expressions; and more importantly no error or warnings about these “features”; it gives me even more incentive not to program in exceptions.

QR in my eyes is not an “ETL” tool – it is just a “EL” tool.

If your designers want complex expressions and transformations programmed into their data pipelines – it should be the downstream developers task.

That’s why they get paid the big bucks.

September 6, 2023 by jonny.donker@gmail.com Qlik Replicate 0

Qlik Replicate – Simple trick for Source file connectors

Quick post on a trick I use in Qlik Replicate

Qlik Replicate has a Source connector to consume flat files from a drive.

More details can be found on the Qlik website

For change data capture of these files; Qlik will only consume files that have a newer “Last Change Date” than when the task is started.

So if for some reason you have to stop and reload files that are already present; you have to change the “Last Modified Date”

A simple Powershell command that I got from itechtics.com can be used to reset all the dates in a directory of your choice.

Get-ChildItem -force PathToItem * | ForEach-Object{$_.LastWriteTime = ("yyyy-mm-dd hh24:mi:ss")}

So to change all the dates of the files in the directory C:tempload to the current time (as of writing):

Get-ChildItem -force c:\temp\load * | ForEach-Object{$_.LastWriteTime = ("2023-06-08 11:05:00")}

If a Qlik Replicate task is running and watching this directory; it will reload the file my_load.file.csv.

June 8, 2023 by jonny.donker@gmail.com Qlik Replicate 0

Qlik Enterprise Manager – Upgrading from 2021.05 to 2022.11

As a business requirement of developers wanting to take advantage of new features in Qlik Enterprise Manager (QEM); we had to upgrade our system from 2021.05 to 2022.11.

This was not a smooth process and encountered several issues on the way.

First Attempt – Straight Upgrade

Simple to upgrade?

Just download the latest installer from Qlik’s website and kick it off?

Unfortunately when I gave developers keys to our dev QEM box to do the upgrade before Christmas as I was busy with other tasks.

They gave it a go without a backout plan.

Couple of days later I tried logging into the QEM console and couldn’t get on.

The AttunityEnterpriseManager service kept restarting every minute and the following errors were in the C:\Program Files\Attunity\Enterprise Manager\data\logs\EnterpriseManager.log

[START: 2023-03-01 09:40:18 Attunity EnterpriseManager Version: 2022.11.0.335]
1 2023-01-02 09:40:18 [Command ] [INFO ] Executing ServiceRunCommand command.
6 2023-01-02 09:40:19 [Host ] [INFO ] Setting up web host
6 2023-01-02 09:40:20 [Host ] [INFO ] The server will listen on the following location: http://xxxxxxx.xxx.xxx/attunityenterprisemanager
6 2023-01-02 09:40:20 [Host ] [INFO ] The server will listen on the following location: https://xxxxxx.xxx.xxx/attunityenterprisemanager
6 2023-01-02 09:40:20 [Host ] [ERROR] Object reference not set to an instance of an object.
6 2023-01-02 09:40:20 [Host ] [INFO ] Stopping service on error...

I tried the usual desperation tricks of restarting the service and then restarting the box.

In the end I had to uninstall QEM and reinstall 2021.05 as there was no backout plan.

The developers were then subsequently banned from doing anymore work on the QEM upgrade

Second attempt – Two jump upgrade

Next attempt was upgrading to 2022.05 and then jumping to 2022.11.

This time I took a backup of C:Program FilesAttunityEnterprise Managerdata so I could easily backout the change if I had problems.

The upgrade from 2021.05 to 2022.05 went smoothly; but when jumping from 2022.05 to 2022.11 the same error occurred.

I then raised a case with Qlik and also posted on their forum asking if anyone else had come across this problem.

Final Success

With help from Qlik support we developed a upgrade work around. Qlik said that there is a patch in the works as other users have had the same problem.

If you follow along – note that this method might differ for your system. In the three upgrades I have done internally; they all differed slightly.

Make sure you TEST and HAVE A BACKOUT PLAN

Upgrade Perquisites

This upgrade was done on a Windows 2019 server.

You will need:

Admin rights to QEM’s server
Licenses for QEM and Analytics
Username/Password that AEM connects to the analytics database
Privilege user account for to the analytics database
Password for the user that you use to connect QR nodes to QEM
A SQLite database client
Handy to have – an editor that can validate json format

Method

Log on to QEM console
For each server under the server tab – Right click and go to “Stop Monitoring”
Remote desktop to the QEM server
Stop the AttunityEnterpriseManager service
Back up C:\Program Files\Attunity\Enterprise Manager\data
Run the upgrade to QEM 2022.05
Once the upgrade is done; log onto QEM and ensure that version 2022.05 is working correctly
Back up C:Program FilesAttunityEnterprise Managerdata again
Stop the AttunityEnterpriseManager service again

Run following command from the command line:

cd "C:\Program Files\Attunity\Enterprise Manager\bin"
aemctl repository export -f C:\temp\qem_repo_202205.json -w

Script out the postgres database with the following commands:

cd C:\Program Files\PostgreSQL10\bin
pg_dump -U <username> -s -C -c <analyticdbname> > C:\temp\analyticdbname.sql

Edit C:\temp\cdcanalyticsp.sql. Find and replace “<analyticdbname>” with “cdcanalytics_2022_11_p”. Save the file
Run the following command to create a new temporary database on the postgres server:
```
psql -d postgres -U <username> -f C:\temp\cdcanalyticsp.sql
```
This is where the problem lies in our instance. Open up C:tempqem_repo_202205.json
Find any code blocks with “type”: “AttunityLicense” in it and remove the whole code block.
1. On our dev system it was the section with “name”: “Replication Analytics”, but in prod it was section with “name”: “Replication Management”.
2. You may need to experiment to find the correct section
Go to Add/Remove and uninstall Enterprise manager
Once remove – delete the the folder C:\Program Files\Attunity\Enterprise Manager\data
Install QEM 2022.11. When prompted do not install Postgres
Log into QEM and check that it is up and running correctly.
Stop the AttunityEnterpriseManager service again

Run the following command:

aemctl repository import -f C:\temp\qem_repo_202205_Fix.json -w

Start the AEM service and check that it is running correctly
Re-add the QEM and Analytics license
Go to Settings -> Repository connections.
Enter in username password set the database as cdcanalytics_2022_11_p
Intialize the repository and start the connector
Stop the AttunityEnterpriseManager service again
Open up C:\Program Files\Attunity\Enterprise Manager\data\Java\GlobalRepository.sqlite in an SQLite editor
Browse the table objects
Look for name = “AnalyticsConfiguration” and modify the json value “cdcanalytics_2022_11_p” to the original analytics database name
Save SQLite database
Start the AEM service and check that it is running correctly
Go to the server tab. Edit each server and re-add the password. Click on the button “Monitor this server’s tasks and messages”. Test the connection and click OK
PVT QEM

If things go wrong

Backout process

Uninstall QEM
Delete the folder C:\Program Files\Attunity\Enterprise Manager\data
Reinstall either QEM 2021.05 or QEM 2022.05
Stop the AttunityEnterpriseManager service
Restore the appropriate backed up data folder to C:\Program Files\Attunity\Enterprise Manager\data
Restart the AEM service

March 2, 2023 by jonny.donker@gmail.com Qlik Replicate 0