May 2024

Qlik Replicate – Fight of the filters. Who will prevail?

Background

The business is doing their best to keep me on my toes with Qlik Replicate; finding new was to bend and stretch the system and consequently my sanity.

The initial request was, “Can we overwrite this field in a Qlik Replicate task with a SOURCE_LOOKUP?”

OK – we can do this.  I abhor putting ETL logic in Qlik Replicate tasks and wanting to keep them as simple as possible and allow the power and the flexibility of the downstream systems to manipulate data.

But project timelines were pressing, and I complied with their request.

Later, they came back to me and requested a to add a filter to the derived field in question.

And that led to me and our Tech Business analyst scratching our heads. 

If we apply a filter to our focus field; will it use the raw field that is in the table? Or will is use the new lookup field with the same name to base the filter on?

Testing the filters – setting up

To start with; some simple tables in MS-SQL:

CREATE TABLE dbo.TEST_LOOKUP
(
	ACCOUNT_ID INT PRIMARY KEY,
	FRUIT_ID INT,
	FRUIT_NAME VARCHAR(100),
	SOURCE_NAME VARCHAR(100)
);

GO

CREATE TABLE dbo.FRUITS
(
	FRUIT_ID INT PRIMARY KEY,
	NEW_FRUIT_NAME VARCHAR(100)
)

GO

INSERT INTO dbo.FRUITS VALUES(1, 'NEW APPLES');
INSERT INTO dbo.FRUITS VALUES(2, 'NEW ORANGES');

A simple Qlik Replicate task was created to replicate from the table dbo.TEST_LOOKUP.

All columns were brought across instead of FRUIT_NAME. FRUIT_NAME will be overwritten with the source lookup:

source_lookup('NO_CACHING','dbo','FRUITS','NEW_FRUIT_NAME','FRUIT_ID =?',$FRUIT_ID)

To test; a simple insert was added to ensure that the source lookup is working correctly:

INSERT INTO dbo.TEST_LOOKUP VALUES(1, 1, 'OLD APPLES', 'Truck');

Result:

{
    "magic": "atMSG",
    "type": "DT",
    "headers": null,
    "messageSchemaId": null,
    "messageSchema": null,
    "message": {
        "data": {
            "ACCOUNT_ID": 1,
            "FRUIT_ID": 1,
            "SOURCE_NAME": "Truck",
            "FRUIT_NAME": "NEW APPLES"
        },
        "beforeData": null,
        "headers": {
            "operation": "INSERT",
            "changeSequence": "20240529060703760000000000000000005",
            "timestamp": "2024-05-29T06:07:03.767",
            "streamPosition": "0071a49f:000f8e09:001c",
            "transactionId": "6EDBA1FA0E0000000000000000000000",
            "changeMask": "0F",
            "columnMask": "0F",
            "transactionEventCounter": 1,
            "transactionLastEvent": true
        }
    }
}

Everything is working correctly; FRUIT_NAMES got overwritten with “NEW APPLES” in the json output.

Testing the filters – placing bets

In the CDC task; a new filter was added:

$FRUIT_NAME == 'NEW ORANGES'

And the following SQL statement was run on the source system:

INSERT INTO dbo.TEST_LOOKUP VALUES(2, 2, 'OLD ORANGES', 'Fridge');

So – If Qlik Replicate filters on the base table’s field; the change WILL NOT be replicated through.

Likewise if Qlik Replicate is using the new derived field for filter; the change WILL come through.

And the results are…

{
    "magic": "atMSG",
    "type": "DT",
    "headers": null,
    "messageSchemaId": null,
    "messageSchema": null,
    "message": {
        "data": {
            "ACCOUNT_ID": 2,
            "FRUIT_ID": 2,
            "SOURCE_NAME": "Fridge",
            "FRUIT_NAME": "NEW ORANGES"
        },
        "beforeData": null,
        "headers": {
            "operation": "INSERT",
            "changeSequence": "20240529061434050000000000000000065",
            "timestamp": "2024-05-29T06:14:34.050",
            "streamPosition": "0071a49f:000f901a:0005",
            "transactionId": "28DCA1FA0E0000000000000000000000",    
            "changeMask": "17",
            "columnMask": "17",
            "transactionEventCounter": 1,
            "transactionLastEvent": true
        }
    }
}

Qlik will use the derived source lookup field over the original field in the table.

Conclusion

Once again this highlights the danger of putting ETL code into Qlik Replicate tasks. It obscures business rules and can lead to confusion in operations like the scenario above.

It is best to use Qlik Replicate to get the data out of the source database as quickly and as pure as possible and then use the power of the downstream systems to manipulate the data.

Qlik Replicate – The saga of replicating to AWS Part 5 – Is MS-SQL the answer?

For those who are following along at home…

We have been toiling on replicating to AWS Postgres RDS with Qlik Replicate for the past two months; trying to achieve a baseline of 300tps.

After many suggestions, tuning and tweaks, conference calls, benchmarks learning experiences, prayers; we couldn’t get our tps close to our baseline.

You can read the main findings in the following pages:

Things that were suggested to us that I did not add to this blog series:

  • Try Async commits on RDS Postgres.  Tried it and got negligible increases.
  • Shifting the QR server to AWS.  Bad idea as there will be even more traffic from the busy DB2 database going across for QR to consume.
  • Use Amazon Aurora instead of RDS.  The downstream developers did not have the appetite to try Aurora; especially with the issues leaning towards network speed.
  • Use GCP version of Postgres instead of AWS.  The downstream developers did not want to commit to another cloud provider.

The problem is how the network connectivity behaves with the Postgres ODBC and the round trips it must do between our location and the AWS data centre.  We can try – but we are bound by the laws of physics and the speed of light.

Decisions to be made.

All though our benchmarking and investigation; we have been replicating to a Development MS-SQL database in the data centre as the DB2 database in parallel to give us an idea of what speed we could potentially reach.  Without triggers on the MS-SQL destination table; we were easily hitting 300tps.  Ramping the changes up; we can hold at 1K tps with no creep in latency. 

We were happy with these results; especially with the MS-SQL database was just a small underpowered shared Dev machine; not a full-blown dedicated server.

It took a brave solution architect to propose that we shift from AWS RDS Postgres to an on prem MS-SQL server; especially when our senior management strategy is to push everything to the cloud to reduce the number of on prem servers.

In the end with all our evidence on the performance and the project’s willingness to push on with the proposed solution, the solution stakeholders agreed to move the destination to an on prem database.

They initially wanted us to go with a on prem Postgres database; but since all our Database Administrators are either Oracle or MS-SQL experts and we have no Postgres experts – we went to good old MS-SQL.

It worked; but…damn triggers.

I volunteered to convert the Postgres SQL code into T-SQL as I have worked with T-SQL for the past decade.  The conversion went smoothly, and I took the opportunity to optimise several sections of the code to make the solution more maintainable and to run faster.

With our new MS-SQL database all coded up and the triggers turned off; the SVT (stress and volume testing) ran at the TPS for which we were aiming.

But when we turned on the triggers; the performance absolutely crashed.

I was mortified – was it my coding shot and the additional changes that I made the performance worse?

I checked the triggers.  I checked the primary keys and the joins.  I checked the horizontal partitioning.  I checked the database server stats for CPU and memory usage. 

Nothing – could not locate the performance problem.

I went back to Qlik Replicate and examined the log files.

Ahh – here is something. The log file was full of entries like this:

00012472: 2024-05-27T16:03:54 [TARGET_APPLY ]W: Source changes that would have had no impact were not applied to the target database. Refer to the 'attrep_apply_exceptions' table for details (endpointshell.c:7632)

Looking inside the attrep_apply_exceptions there corresponding entries like:

UPDATE [dbo].[TEST_DESTINATION] 
SET	[ACCOUNT_ID] = 2,
	[DATA_1] = 'Updated', 
	[DATA_2] = 'Data' 
WHERE 
	[ACCOUNT_ID] = 2;
	-- 0 rows affected

Which was confusing; I checked the destination table, and the update was applied.  Why was this update deemed a failure and logged to the attrep_apply_exceptions table? It must be an error in the trigger.

The cause of the problem

Our code can be paraphrased like:

CREATE TABLE dbo.TEST_DESTINATION
(
	ACCOUNT_ID int NOT NULL,
	DATA_1 varchar(100) NULL,
	DATA_2 varchar(100) NULL,
	PRIMARY KEY CLUSTERED 
	(
		ACCOUNT_ID ASC
	)
);

GO

CREATE TABLE dbo.TEST_MERGE_TABLE
(
	ACCOUNT_ID int NOT NULL,
	DATA_1 varchar(100) NULL,
	DATA_2 varchar(100) NULL,
	DATA_3 varchar(100) NULL,
	PRIMARY KEY CLUSTERED 
	(
		ACCOUNT_ID ASC
	)
)

GO

CREATE OR ALTER TRIGGER dbo.TR_TEST_DESTINATION__INSERT
ON dbo.TEST_DESTINATION
AFTER INSERT 
AS
	INSERT INTO dbo.TEST_MERGE_TABLE
	SELECT
		ACCOUNT_ID,
		DATA_1,
		DATA_2,
		'TRIGGER INSERT' AS DATA_3
	FROM INSERTED;

GO

CREATE OR ALTER TRIGGER [dbo].[TR_TEST_DESTINATION__UPDATE]
ON [dbo].[TEST_DESTINATION]
AFTER UPDATE 
AS
	UPDATE dbo.TEST_MERGE_TABLE
	SET ACCOUNT_ID = X.ACCOUNT_ID,
		DATA_1 = X.DATA_1,
		DATA_2 = X.DATA_2,
		DATA_3 = 'TRIGGER DATA'
	FROM dbo.TEST_MERGE_TABLE T
	JOIN INSERTED X
	ON X.ACCOUNT_ID = T.ACCOUNT_ID
	WHERE
		1 = 0;  -- This predicate can be either true or false.  For example we set it false

GO

The problem is in how the trigger TR_TEST_DESTINATION__UPDATE behaves if it returns 0 rows.  This can be a legitimate occurrence depending on a join in the trigger.

If I run a simple update like:

UPDATE dbo.TEST_DESTINATION
SET DATA_1 = 'Trigger Upate'
WHERE
	ACCOUNT_ID = 1;

The SQL engine returns:

(0 row(s) affected)    -- Returned from the trigger

(1 row(s) affected)    -- Returned from updating dbo.TEST_DESTINATION

My theory is that Qlik Replicate when reading the rows returned from executing the SQL statement on the destination server; only considers the first row to determine if the change was a success or not.  Since the first row is an output from the trigger with 0 rows affected; Qlik considers that the update was a failure and therefore logs it into the attrep_apply_exceptions table.

Apart from this been incorrect as the trigger code is logically working correctly; Qlik Replicate must make another trip to write to the exception table.  This resulted in drastically increased latency.

Fixing the issue

The fix (once the problem is known) is relatively straight forward. Any rows returned needs to be supressed from the trigger. For example:

CREATE OR ALTER TRIGGER [dbo].[TR_TEST_DESTINATION__UPDATE]
ON [dbo].[TEST_DESTINATION]
AFTER UPDATE 
AS
BEGIN
	SET NOCOUNT ON;  -- Supress returning the row count

	UPDATE dbo.TEST_MERGE_TABLE
	SET ACCOUNT_ID = X.ACCOUNT_ID,
		DATA_1 = X.DATA_1,
		DATA_2 = X.DATA_2,
		DATA_3 = 'TRIGGER DATA'
	FROM dbo.TEST_MERGE_TABLE T
	JOIN INSERTED X
	ON X.ACCOUNT_ID = T.ACCOUNT_ID
	WHERE
		1 = 0;  -- This predicate can be either true or false
END

When the update statement is run again; the following is returned:

(1 row(s) affected)

Qlik Replicate will now consider the update as a success and not log it as an exception.

Qlik Replicate – The saga of replicating to AWS Part 4 – Does stream size matter?

Continuing our ongoing Qlik Replicating story of trying to replicate a DB2/zOS database to AWS RDS Postgres.

We made small improvements; but nothing substantial to reach the TPS for which we were aiming.  I was at my experience end of what I knew and decided to reach for professional help.

We have a support relationship with IBT; who helped us out with the initial set up QR in our organisation.  But recently we have been self-resolving our own problems and have not been using their help.  Now this was suitable time to ask for their help.

IBT has always been helpful when we have asked for assistance.  Another handy aspect with the relationship is that IBT has a quick support relationship with Qlik.  If they don’t know the answer; they can get the answer easily from Qlik.

IBT asked us to collect the usual data; diagnostic logs, source and target DB metrics and QR server core metrics.  Nothing looked under duress, so IBT dived into the nitty and gritty details of the diagnostic packs.

Their techs noticed that our outgoing stream buffers were full.  This means that the changes were coming in faster than were getting sent out to the destination.  IBT suggested to try increasing the size of the outgoing stream.

Without going through the details of this step; here is a Qlik knowledge base article on Increasing the outgoing stream queue.

“She’s breaking up – I can’t hold her”

We started off with:

"stream_buffers_number" : 5,
"stream_buffer_size" : 10,

It was a marginal improvement.  Measured in a thimble full of performance improvement.  Still nowhere near the TPS we needed. 

IBT asked us to increase the two variables in small increments of “stream_buffers_number” + 5 and “stream_buffer_size” + 10.  With each increase there was a minuscule improvement.

 But more worrying with each increase; the QR task was using more memory to the point that increasing the buffer size was unsustainable with the resources on the server.  Even if increasing the buffer variables and the gained TPS was linear relationship; we would need a very beefy server to reach 300 TPS.

So again, it was a little gain; and with all the added “Little gains” over the past few fix iterations we were still no closer to our needed 300 TPS.

Increasing the buffer variables might be helpful if you are close to your TPS and trying to get over the last hurdle.  But since we’re so far behind; we had to look for another solution.

Qlik Replicate – The saga of replicating to AWS Part 3 – Wireshark!

Continuing on the story

After concluding that the low TPS is not resulting from poor query performance; our attention was turned to the network latency between our OnPrem Qlik system and the AWS RDS database.

First, I asked the networks team if there were any suspect networking components between our on-premise’s Qlik server and the AWS DB.  Anything like IPS, QOS, bandwidth limitation components that could explain the slowdown.

I also asked the cloud team if they can find anything as well.

It was a high hope for them to find anything; but since they are the SMEs in the area, it was worth asking the question. 

As expected, they did not find anything.

But the Network team  did come back with a couple of pieces of information:

  • The network bandwidth to the AWS was wide enough and we were not reaching its capacity.
  • It is a 16ms – 20ms round trip from our Data centre to the AWS data centre. 

Loaction… Location…

Physically the distance to the AWS data centre is 700Km. 

Unfortunately, AWS set up a closer data centre in the past few years, which is only 130Km away.  We are not currently set up to use this new region yet.

The Network team gave me permission to install wire shark on our OnPrem Qlik server and our AWS EC2 Qlik server. 

From both servers with psql I connected to the AWS RDS database and updated one row; capturing the traffic using Wireshark.

I lined up the two results from the different servers to see if there was anything obvious

Wireshark results

(ip.src == ip.of.qlik.server and ip.dst == ip.of.aws.rds) or (ip.src == ip.of.aws.rds and ip.dst == ip.of.qlik.server)
SEQSourceDestinationProtocolLengthInfoOn Prem 2 RDSEC2 2 RDSDifference (sec)% of difference
1Qlik serverRDS DBTCP6658313 > 5432 [SYN, ECE, CWR] Seq=0 Win=64240 Len=0 MSS=1460 WS=256 SACK_PERM000.0000%
2RDS DBQlik serverTCP665432 > 58313 [SYN, ACK] Seq=0 Ack=1 Win=26883 Len=0 MSS=1460 SACK_PERM WS=80.0190.0010.01810%
3Qlik serverRDS DBTCP5458313 > 5432 [ACK] Seq=1 Ack=1 Win=262656 Len=00.0000.0000.0000%
4Qlik serverRDS DBPGSQL62>?0.0000.005-0.005-3%
5RDS DBQlik serverTCP605432 > 58313 [ACK] Seq=1 Ack=9 Win=26888 Len=00.0180.0000.01810%
6RDS DBQlik serverPGSQL60<0.0010.0010.0000%
7Qlik serverRDS DBTLSv1.3343Client Hello0.0040.0040.0010%
8RDS DBQlik serverTLSv1.3220Hello Retry Request0.0210.0010.02112%
9Qlik serverRDS DBTLSv1.3455Change Cipher Spec, Client Hello0.0030.0010.0021%
10RDS DBQlik serverTLSv1.3566Server Hello, Change Cipher Spec0.0230.0050.01911%
11RDS DBQlik serverTCP15145432 > 58313 [ACK] Seq=680 Ack=699 Win=29032 Len=1460 [TCP segment of a reassembled PDU]0.0000.0000.0000%
12RDS DBQlik serverTCP15145432 > 58313 [ACK] Seq=2140 Ack=699 Win=29032 Len=1460 [TCP segment of a reassembled PDU]0.0000.0000.0000%
13RDS DBQlik serverTCP15145432 > 58313 [ACK] Seq=3600 Ack=699 Win=29032 Len=1460 [TCP segment of a reassembled PDU]0.0000.0000.0000%
14RDS DBQlik serverTLSv1.3394Application Data0.0000.0000.0000%
15Qlik serverRDS DBTCP5458313 > 5432 [ACK] Seq=699 Ack=5400 Win=262656 Len=00.0000.0000.0000%
16Qlik serverRDS DBTLSv1.3112Application Data0.0030.0020.0011%
17Qlik serverRDS DBTLSv1.3133Application Data0.0000.0000.0000%
18RDS DBQlik serverTCP605432 > 58313 [ACK] Seq=5400 Ack=836 Win=29032 Len=00.0180.0000.01810%
19RDS DBQlik serverTLSv1.3142Application Data0.0010.008-0.007-4%
20RDS DBQlik serverTLSv1.3135Application Data0.0060.0030.0032%
21Qlik serverRDS DBTCP5458313 > 5432 [ACK] Seq=836 Ack=5569 Win=262400 Len=00.0000.001-0.0010%
22Qlik serverRDS DBTLSv1.3157Application Data0.0050.007-0.002-1%
23RDS DBQlik serverTLSv1.3179Application Data0.0180.0010.01810%
24Qlik serverRDS DBTLSv1.3251Application Data0.0110.0000.0116%
25RDS DBQlik serverTLSv1.3147Application Data0.0180.0000.01811%
26RDS DBQlik serverTLSv1.3433Application Data, Application Data0.0000.0000.0000%
27RDS DBQlik serverTLSv1.398Application Data0.0000.0000.0000%
28Qlik serverRDS DBTCP5458313 > 5432 [ACK] Seq=1136 Ack=6210 Win=261888 Len=00.0000.0000.0000%
29Qlik serverRDS DBTLSv1.393Application Data0.0010.0010.0010%
30RDS DBQlik serverTLSv1.3148Application Data0.0200.0010.01811%
31RDS DBQlik serverTLSv1.398Application Data0.0000.0000.0000%
32Qlik serverRDS DBTCP5458313 > 5432 [ACK] Seq=1175 Ack=6348 Win=261632 Len=00.0000.0000.0000%
33Qlik serverRDS DBTLSv1.381Application Data0.0000.0000.0000%
34Qlik serverRDS DBTLSv1.378Application Data0.0000.0000.0000%
35Qlik serverRDS DBTCP5458313 > 5432 [FIN, ACK] Seq=1226 Ack=6348 Win=261632 Len=00.0000.0000.0000%
36RDS DBQlik serverTCP605432 > 58313 [ACK] Seq=6348 Ack=1226 Win=30104 Len=00.0190.0000.01811%
37RDS DBQlik serverTCP605432 > 58313 [FIN, ACK] Seq=6348 Ack=1227 Win=30104 Len=00.0000.0000.0000%
38Qlik serverRDS DBTCP5458313 > 5432 [ACK] Seq=1227 Ack=6349 Win=261632 Len=00.0000.0000.0000%

The data from the two captures showed a couple of things:

Firstly, both systems had the same number of events captured by Wireshark.  This gives me an indication that there are no networking components from source to destination that is dropping traffic; or doing anything extra unexpected actions to the packet requests. 

I cannot say for sure what is happening on the return trip if there is anything timing out from the AWS side back.

Also, when taking the difference between the OnPrem vs the EC2 server I can see the difference of 18ms keep popping up.  I believe this is the round trip of the connection.  Since this happens multiple times; our latency is compounded into quite a significant value.

What’s next?

I am not a network engineer, so I do not have the knowledge to dive deeper into the Wireshark packets. 

It would be interesting to try the closer AWS data centre to see if the physical distance can help the latency.  But to do this will require effort from the cloud team and the project budget wouldn’t extend to this piece of work.

Our other option is to reduce the number of round trips from our OnPrem server to the AWS datacentre as much as possible.

Qlik Replicate – The saga of replicating to AWS Part 2 – Lions and Tigers and Primary Keys (Oh My!)

The Story so far:

For those who are following from home:

We deduced that replicating through the AWS RDS proxy was a bad idea for performance; but even going directly to the AWS RDS Postgres database from our on-prem system did not give us any noticeable performance gains.

It was time to try looking in different area.

Raw SQL

In the squad investigating the problem, someone asked, “What SQL is Qlik running on the Postgres server?  Is it poorly optimised?”

To find out what Qlik Replicate is running; I bumped the TARGET_APPLY up to Verbose, let transactions run through for a couple of minutes and then assessed the log file.  Since 99% of the changes are updates on the source database (we know this from a production job that is running on the same source database); our logs were full of update statements like this:

UPDATE dbo.my_dest_table
SET	key_1 = ?,
	key_2 = ?,
	key_3 = ?,
	field_1 = ?,
	field_2 = ?,
	field_3 = ?
WHERE
	key_1 = ? AND
	key_2 = ? AND
	key_3 = ?;

The first thing I noticed was – Qlik was updating ALL the columns, even if only one column’s data changed on the source database. 

This also included the Primary key.

Primary keys are computationally expensive updating compared to just your regular day to day columns.  As for an example, I wrote a simple POC in t-sql:

/*
CREATE TABLE dbo.my_dest_table
(
	key_1 SMALLDATETIME,
	key_2 INT,
	key_3 VARCHAR(10),

	field_1 INT,
	field_2 VARCHAR(256),
	field_3 SMALLDATETIME,
	PRIMARY KEY(key_1, key_2, key_3)
)
*/

/*
UPDATE dbo.my_dest_table
SET	key_1 = ?,
	key_2 = ?,
	key_3 = ?,
	field_1 = ?,
	field_2 = ?,
	field_3 = ?
WHERE
	key_1 = ? AND
	key_2 = ? AND
	key_3 = ?;
*/

/*
INSERT INTO dbo.my_dest_table VALUES ('2024-01-01', 1, 'key 3', 1, 2, '2023-03-03');
INSERT INTO dbo.my_dest_table VALUES ('2024-01-01', 2, 'key 3', 2, 3, '2024-04-04');
*/

SET STATISTICS TIME ON;

UPDATE dbo.my_dest_table
SET key_1 = '2024-01-01',
	key_2 = 1,
	key_3 = 'key 3',

	field_1 = 10,
	field_2 = 2,
	field_3 = '2023-03-03'
WHERE
	key_1 = '2024-01-01' AND
	key_2 = 1 AND
	key_3 = 'key 3';
/*
SQL Server parse and compile time: 
   CPU time = 0 ms, elapsed time = 0 ms.

 SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 41 ms.  <------- Oh dear

(1 row affected)
*/

UPDATE dbo.my_dest_table
SET field_1 = 10,
	field_2 = 3,
	field_3 = '2024-04-04'
WHERE
	key_1 = '2024-01-01' AND
	key_2 = 2 AND
	key_3 = 'key 3';
/*
SQL Server parse and compile time: 
   CPU time = 0 ms, elapsed time = 0 ms.

 SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 6 ms.

(1 row affected)
*/

As demonstrated updating the PK columns (even with the same data) is eight times more expensive than just updating the other data columns. 

Asking for help

I asked for help on the Qlik Community forum.  I got a response back from the specialists that there is a setting in Qlik’s endpoints that will NOT update the Primary key.

The setting is:

$info.query_syntax.pk_segments_not_updateable

Set to TRUE

I made this change in the endpoint, restarted the task and captured a few more update sql samples. With the change in the settings; the updates now discarded the updates to the PK.

BUT…

This might be a helpful solution and worthy to note for the future; but in this case the solution did not really help us out.  After making the change we noticed no noticeable change in our TPS throughput; confirming increasingly it was a network issue.

I could see also other problems with this.

The source system that we are reading from is exceptionally large, have been entrenched in our organisation for 30 years and there is constant work done in the background as other core system are merged into it.  We as a downstream user; cannot guarantee the rules around updating Primary Keys.  Gut feeling in day-to-day operation that once the PKs are set, they are set in stone in the database.  But to base our data integrity for critical reporting on accuracy of data on a “Gut feeling” is not a good data integrity model. 

Who to say that the owners of the source system perform an update to a PK to fix an incident; or a new source system consolidation have updates in the steps?  Our change management in the organisation is not tight and detailed enough for us to be forewarned of changes like this.

If data integrity were not so critical; we could run a full refresh periodically to reset the data to a baseline to capture updates of PK.  But this will require significant rework of the solution to manage these periodical refreshes.

So; we did learn.  We did assess.  But this didn’t help us out.