Qlik Replicate

Qlik Replicate – The saga of replicating to AWS (Intro)

I am exhausted from the past couple of weeks; even getting close to a month of working on a new Qlik Replicate tasks that we are getting no luck with performance.

The upside I have learned many new concepts and techniques in Qlik Replicate, Postgres, MS-SQL and networking.
The downside is that (at the time of writing) we have not solved the problem, and a lot of effort hasn’t returned anything tangible for the business to use.

This post will be an introduction to the problem and the next few posts will dive into the individual areas of QR quirks, diagnostic methods and performance turning that I used.

Background

We have a big expensive DB2 Z/OS database running on premise that have our core account data in hosted in it. A logical account will be divided to two different tables; slowly changing data (like open date, close date) in one table and constant changing data (like account balance) in another table.

Downstream users want these two tables stitched together and “Change Data Capture” implemented on this combined table. The initial thought was to create triggers in DB2 so when data is updated on the “Slowly Changing Data Table” (SCDT) or the “Constant Changing Data Table” (CCDT); the triggers act on a “Combined Table” (CT).

This solution was deemed too expensive by the business; the triggers would drag down the performance of an already stressed DB2 system and to purchase more processing power was not in the viable.

The architects thought, “OK – why don’t we replicate the SCDT and CCDT to AWS RDS Postgres and build the triggers in there? We can then scale the Postgres database cheaply.”

Sound thinking. We did a quick POC that a downstream database can be triggered by data flowing into tables from Qlik Replicate and it worked as expected.

The project jumped into build mode and soon we had development done and testing on the way.

Then it was time to do the “Stress and Volume” testing. 

That is when things started to unravel…

High Hopes but low TPS

Since the Development DB2 Z/OS database shares the same hardware as Production; we had to conduct our Stress and Volume Testing (SVT) on a weekend, so our customers were not adversely affected.

Two test runs were conducted; on SVT run without the database triggers enabled and another with them enabled.

The results were not good. Our SVT SME wrote in their summary:

We ran two load test on Dec 9th Saturday – first test with database trigger OFF, second test with trigger ON.

Each test ran for about 1 hour with total 700k+ transactions at throughput 170tps to 190tps.

  1. During first test with trigger OFF, Qlik process T007 had throughput at 50tps, well below other Qlik processes at 180tps. This suggests the performance bottleneck might not be on the database trigger.During second test with trigger ON, Qlik process T007/T008 had a slightly lower throughput at 40tps, which means the database trigger adds a little bit more latency.

    In the middle of 2nd test, Qlik Enterprise Manager became broken and did not display performance graphs properly, hence I can’t provide you graphs till Jon fixes that.

  2. After the 1-hour test stopped, it took another 3.5 hours for Qlik process T007/T008 to finish its backlog. That means T007/T008 need 4.5 hour in total to complete DB2 data produced in each hour of test. Maths tells us 40:180 = 1:4.5. In other words, the “producer” (DB2 Database) works 4.5 times faster than the “consumer” (Qlik T007/T008).
  3. I notice the memory usage of DB2 related Qlik processes kept going up during test. Not sure if that is the expected behaviour.

Since our bare baseline that we decided internally was at 300tps; this was not looking good.

What lead next was a journey of performance diagnosing with may trials, tribulations, dead ends and a final way forward.

Header Illustration Credit


Qlik Replicate – Where’s my slots gone?

Qlik Replicate is growing rapidly in our organisation.

Starting with one “Proof of concept” server in DTL; we are now up to seventeen.  A mixture of development, test, performance testing servers on OnPrem and in the AWS and GCS cloud; depending where the source database is located.

With trying to keep things neat and organised; we are constantly moving QR tasks from one server to another. 

Anyway the Juniors came to me with the following error message when they were shifting a QR task with a AWS RDS postgres source:

2024-01-31T16:39:38 [SOURCE_CAPTURE  ]E:  Can't resume task after replication slot was dropped. [1020101]  (postgres_endpoint_capture.c:444)

From my understanding (I wasn’t babysitting their process); they exported the old task out with Endpoints and then reimported it into the new server.  They re-entered the passwords for the end points and then try to resume from the stream position of the old task.

I couldn’t fault that logic.

As a test we cloned that task on the same server and made the target endpoint as a NULL.

With this test task we could

  1. Start the task from scratch
  2. Resume it from the stream position that the old task got up to

This proved that the new server had no issues contacting the database and resuming from a position on the postgres database.

In the end I guessed that there was a problem with the replication slot (or lack of) for that task on the postgres server; not sure specifically but it was looking for something that it didn’t exist.

I worked out the following steps to resolve the problem:

  1. Change the task’s target endpoint to a NULL connector.
  2. Start the task with the “Reload Target” option.
  3. Wait until the task has done the “Full load” and has entered CDC mode.
  4. Stop the task and change the target endpoint back to the original end point.
  5. Start the task again with “Advance Run Options” and resume from the “Source change position” of the original taks.

The task should now be happy

Qlik Replicate – The million row challenge with

It seems every new project; the project teams come with new and innovative ways to make my life “interesting” while managing Qlik Replicate.

The latest idea they have is to pretty much request that one of our major Qlik Replicate tasks that is sucking data our of a DB2 database and delivering data to AWS S3 be duplicated and deliver pretty much the same data to GCS.

When I got wind of this proposal it was already too far into the solution design to propose another solution; like copying the data from GCS to AWS S3.  So, it was left up to me to go and tell the owners of the DB2 system that there is another massive feed will be built of their database – basically doing the same thing.

With some persuasion I could get internal funding to research and implement Log Streaming to consolidate our reads from the DB2 system.  Hardware wise this required adding more memory and disk space to the QR servers to handle the log streaming. 

As for the Log stream component of the solution, I started looking into the settings.  Since we are drawing a lot of data from DB2 – I am concerned about disk space and what performance impact the compression setting has on the task.

To test the impact; I set up the following test scenario:

The "Million row" challenge

To test – I created a very wide table in a test MS-SQL (a database I had access to); which contained integers and varchars.

With a python script, a million rows were inserted into the table in batches of 10,000.  In the end the million rows equated to about 1.3Gb on the database.

I did ten test runs, increasing the compression in the log stream endpoint with each individual run.

 

Using Qlik Replicate’s analytics I could then get the metrics for the Log Stream’s task; and a task that is reading from the log stream and outputting the data to a NULL endpoint.

I also grabbed the file sizes that the log stream created on the server’s drive.

Here are the results:

  Log Stream File Size (Mb) LS – CPU % LS – Memory (MB) LS – Source latency (sec) LS – Target latency (sec) Target – CPU % Target – Memory (MB) Target – Source latency (sec) Target – Target latency (sec)
No compression 1,974 7 134 1 3 4 126 3 4
Level 2 718 15 169 2 4 4 105 3 3
Level 3 690 14 156 0 3 4 118 2 2
Level 4 683 16 155 1 2 6 127 2 2
Level 5 675 21 281 1 5 6 156 3 3
Level 6 665 24 224 1 3 5 176 4 4
Level 7 663 23 182 1 3 4 160 4 4
Level 8 657 38 394 1 9 4 181 9 9
Level 9 657 46 345 1 10 6 189 12 12
                   

Some notes on the results

  • The run was done on a shared server; so other processes on the server running at the same time could impact the CPU and memory metrics
  • Analytics only takes a snapshot of the metrics every 5min and the million rows are inserted into the database in less than <10min so we can only get one snapshot of the metrics for the run.  Depending where the run was up to; this could impact the metrics

Conclusions

Like most things in IT; there is no magical setting that will cover all scenarios.  Everything is a compromise, and the settings will need to be adjusted to meet one’s particular source endpoint, nature of the data being replicated, and the latency can be tolerated to disk space usage.

Here is what I learnt.

  • If you require absolute minimum latency as possible – log steaming is not your answer for a couple of reasons. 
    1. There is an inherited one second delay writing batches from the source system to the log stream.
    2. All the data is written to disk and then read by the target QR task.  This means that the changes do not reside in memory and need to be read from disk.
  • Level 2 significantly reduces disk space usage with little compromise to speed and resource usage.
  • Level 4 is the best compromise of disk space, speed, and resource usage.  Anything above level 4 starts to strain the speed and resource usage with little improvement to disk space.

Implementing Log streaming plan

For my scenario – I think Level 2 is the way to go and will stem my testing out from there. 

My implementation plan is to implement the log streaming task into production and let it run by itself for a couple of weeks.

During that time, I will monitor how the log stream task behaves with large batches and how much data it writes to disk as we cannot generate that amount data in our performance testing environments.

Once the log stream task is settled; then I will start migrating the existing Qlik Replicate tasks to read from the log stream task stream instead of directly from the database.

Qlik Replicate Expressions – Wait. That’s Illegal

A new team will be taking over for out of hours oncall for Qlik Replicate and I tasked with training up a new team to diagnose and resolve problems with QR.

I thought it was a good idea to create some practice QR tasks in our development environment; intentionally put some problems in the tasks and let the new team diagnose the problem in a safe and no pressure scenarios.

I did the obvious ones; like put an incorrect password in an end point, put a typo in a filter criteria.

Looking for another scenario I thought – “Nothing better than a divide by zero error.”

Wait. That’s Illegal – Part 1

I created a simple table; with a field of CURR_BAL = 1000 and ASSET_VALUE = 0 and added a new field call LVR with the expression of:

$CURR_BALANCE / $ASSET_VALUE

Ran a full load expecting lots of nice and nasty error messages for the new team to investigate.

Hmm – no errors.

That’s not what I expected.

All the rows were transferred across and there was no errors or warnings in the log file, no data errors or anything.

I bumped up all the logging to “Trace” and reran the full load.  Once again, no error messages.

Redirecting the output from the task to a kafka topic; I notice the records that had an intentional div/0 scenario was coming through as null.

{
  "magic": "atMSG",
  "type": "DT",
  "headers": null,
  "messageSchemaId": null,
  "messageSchema": null,
  "message": {
    "data": {
      "ACCOUNT_ID": 5,
      "ACCOUNT_NAME": "Elissa",
      "CURR_BALANCE": "10000.00",
      "ASSET_VALUE": "0.00",
      "LVR": null
    },
    "beforeData": null,
    "headers": {
      "operation": "REFRESH",
      "changeSequence": "",
      "timestamp": "",
      "streamPosition": "",
      "transactionId": "",
      "changeMask": null,
      "columnMask": null,
      "transactionEventCounter": null,
      "transactionLastEvent": null
    }
  }
}

Then I remembered – CDC expressions are built on sqlite syntax.  Looking up how sqlite handles div/0 scenarios; by default sqlite will return a null instead of failing the query on a div/0.

Quoting from Sqlite – Difference Between engines

SQLite does not raise arithmetic exceptions (eg. divide by zero, 1/0). SQLite returns a NULL value for 1/0.

Wait. That’s Illegal – Part 2

I got past my div/0 conundrum after the explanation of what was happening and had a look at a “valid” record.

{
    "magic": "atMSG",
    "type": "DT",
    "headers": null,
    "messageSchemaId": null,
    "messageSchema": null,
    "message": {
        "data": {
            "ACCOUNT_ID": 1,
            "ACCOUNT_NAME": "Alice",
            "CURR_BALANCE": "12345.00",
            "ASSET_VALUE": "500000.00",
            "LVR": 0        <------------- Huh?
        },
        "beforeData": null,
        "headers": {
            "operation": "REFRESH",
            "changeSequence": "",
            "timestamp": "",
            "streamPosition": "",
            "transactionId": "",
            "changeMask": null,
            "columnMask": null,
            "transactionEventCounter": null,
            "transactionLastEvent": null
        }
    }
}

Again our LVR value was wrong.

12345.00 / 500000.00 should be 0.02469; instead of 0

A couple of things popped to mind.

It looked like the classic integer division issue that always catches out new (and experienced) programmer.

But that didn’t make any sense since the source fields were both numeric(18,2).

I checked the destination field that I added.

It was set to numeric(18,2), which isn’t precise enough for the LVR value but we still should have got 0.2 instead 0.

OK – QR must not like the “custom” precision of numeric(18,2) for the LVR field. 

I tried forcing the LVR field data type to REAL(4).

Still got the incorrect value for LVR

Maybe for some reason QR /sqlite treats the values coming in as an integer; even though the source data type is numeric(18,2) on the database.

Let’s give this a go:

CAST($CURR_BALANCE AS NUMERIC) / $ASSET_VALUE

^^ Nope

CAST($CURR_BALANCE AS NUMERIC) / CAST($ASSET_VALUE AS NUMERIC) 

^^ Long shot but nope

CAST($CURR_BALANCE AS REAL) / $ASSET_VALUE 

^^ At last success

{
    "magic": "atMSG",
    "type": "DT",
    "headers": null,
    "messageSchemaId": null,
    "messageSchema": null,
    "message": {
        "data": {
            "ACCOUNT_ID": 1,
            "ACCOUNT_NAME": "Alice",
            "CURR_BALANCE": "12345.00",
            "ASSET_VALUE": "500000.00",
            "LVR": "0.0246"  <------------- Finally!
        },
        "beforeData": null,
        "headers": {
            "operation": "REFRESH",
            "changeSequence": "",
            "timestamp": "",
            "streamPosition": "",
            "transactionId": "",
            "changeMask": null,
            "columnMask": null,
            "transactionEventCounter": null,
            "transactionLastEvent": null
        }
    }
}

It works – but forcing a precise of numeric value to a floating point data type REAL leaves me uneasy; especially in a finance world were precision is needed.

I haven’t researched or tested whether this is a risk of generating an imprecise number through a division or not; something I will need to put on my todo list.

Conclusion

My design principle for Qlik Replicate tasks is to keep them as simple as possible.

Grab the data off the source and get it to the destination as quickly as possible.

By design principle I am against putting expressions in QR tasks:

  1. It is not as feature rich as other programming languages 
  2. It hides business rules that adds a black box and complexity to the pipeline

Point 2 was the big one for me.  I didn’t want downstream users messaging me every time there was an abnormality in the data.  With no expressions in the QR tasks – I could rightfully say, “QR just grabs what it finds on the source database – go and talk to them”

Now with the discovery of these little “features” that could come from expressions; and more importantly no error or warnings about these “features”; it gives me even more incentive not to program in exceptions.

QR in my eyes is not an “ETL” tool – it is just a “EL” tool.

If your designers want complex expressions and transformations programmed into their data pipelines – it should be the downstream developers task.

That’s why they get paid the big bucks.


Qlik Replicate – Simple trick for Source file connectors

Quick post on a trick I use in Qlik Replicate

Qlik Replicate has a Source connector to consume flat files from a drive. 

More details can be found on the Qlik website

For change data capture of these files; Qlik will only consume files that have a newer “Last Change Date” than when the task is started.

So if for some reason you have to stop and reload files that are already present; you have to change the “Last Modified Date”

A simple Powershell command that I got from itechtics.com can be used to reset all the dates in a directory of your choice.

Get-ChildItem -force PathToItem * | ForEach-Object{$_.LastWriteTime = ("yyyy-mm-dd hh24:mi:ss")}

So to change all the dates of the files in the directory C:tempload to the current time (as of writing):

Get-ChildItem -force c:\temp\load * | ForEach-Object{$_.LastWriteTime = ("2023-06-08 11:05:00")}

If a Qlik Replicate task is running and watching this directory; it will reload the file my_load.file.csv.

Qlik Enterprise Manager – Upgrading from 2021.05 to 2022.11

As a business requirement of developers wanting to take advantage of new features in Qlik Enterprise Manager (QEM); we had to upgrade our system from 2021.05 to 2022.11.

This was not a smooth process and encountered several issues on the way.

First Attempt – Straight Upgrade

Simple to upgrade?

Just download the latest installer from Qlik’s website and kick it off?

Unfortunately when I gave developers keys to our dev QEM box to do the upgrade before Christmas as I was busy with other tasks.  

They gave it a go without a backout plan.  

Couple of days later I tried logging into the QEM console and couldn’t get on.

The AttunityEnterpriseManager service kept restarting every minute and the following errors were in the  C:\Program Files\Attunity\Enterprise Manager\data\logs\EnterpriseManager.log

[START: 2023-03-01 09:40:18 Attunity EnterpriseManager Version: 2022.11.0.335]
1 2023-01-02 09:40:18 [Command ] [INFO ] Executing ServiceRunCommand command.
6 2023-01-02 09:40:19 [Host ] [INFO ] Setting up web host
6 2023-01-02 09:40:20 [Host ] [INFO ] The server will listen on the following location: http://xxxxxxx.xxx.xxx/attunityenterprisemanager
6 2023-01-02 09:40:20 [Host ] [INFO ] The server will listen on the following location: https://xxxxxx.xxx.xxx/attunityenterprisemanager
6 2023-01-02 09:40:20 [Host ] [ERROR] Object reference not set to an instance of an object.
6 2023-01-02 09:40:20 [Host ] [INFO ] Stopping service on error...

I tried the usual desperation tricks of restarting the service and then restarting the box.

In the end I had to uninstall QEM and reinstall 2021.05 as there was no backout plan.

The developers were then subsequently banned from doing anymore work on the QEM upgrade

Second attempt – Two jump upgrade

Next attempt was upgrading to 2022.05 and then jumping to 2022.11.

This time I took a backup of C:Program FilesAttunityEnterprise Managerdata so I could easily backout the change if I had problems.

The upgrade from 2021.05 to 2022.05 went smoothly; but when jumping from 2022.05 to 2022.11 the same error occurred.

I then raised a case with Qlik and also posted on their forum asking if anyone else had come across this problem.

Final Success

With help from Qlik support we developed a upgrade work around.  Qlik said that there is a patch in the works as other users have had the same problem.

If you follow along – note that this method might differ for your system.  In the three upgrades I have done internally; they all differed slightly.

Make sure you TEST and HAVE A BACKOUT PLAN

Upgrade Perquisites

This upgrade was done on a Windows 2019 server.

You will need:

  • Admin rights to QEM’s server
  • Licenses for QEM and Analytics
  • Username/Password that AEM connects to the analytics database
  • Privilege user account for to the analytics database
  • Password for the user that you use to connect QR nodes to QEM
  • A SQLite database client
  • Handy to have – an editor that can validate json format

Method

  1. Log on to QEM console
  2. For each server under the server tab – Right click and go to “Stop Monitoring”
  3. Remote desktop to the QEM server
  4. Stop the AttunityEnterpriseManager service
  5. Back up C:\Program Files\Attunity\Enterprise Manager\data
  6. Run the upgrade to QEM 2022.05
  7. Once the upgrade is done; log onto QEM and ensure that version 2022.05 is working correctly
  8. Back up C:Program FilesAttunityEnterprise Managerdata again
  9. Stop the AttunityEnterpriseManager service again
  10. Run following command from the command line:
    cd "C:\Program Files\Attunity\Enterprise Manager\bin"
    aemctl repository export -f C:\temp\qem_repo_202205.json -w

  11. Script out the postgres database with the following commands:
    cd C:\Program Files\PostgreSQL10\bin
    pg_dump -U <username> -s -C -c <analyticdbname> > C:\temp\analyticdbname.sql

  12. Edit C:\temp\cdcanalyticsp.sql.  Find and replace “<analyticdbname>” with “cdcanalytics_2022_11_p”.  Save the file
  13. Run the following command to create a new temporary database on the postgres server:
    psql -d postgres -U <username> -f C:\temp\cdcanalyticsp.sql

  14. This is where the problem lies in our instance.  Open up C:tempqem_repo_202205.json
    Find any code blocks with “type”: “AttunityLicense” in it and remove the whole code block.
    1. On our dev system it was the section with “name”: “Replication Analytics”, but in prod it was section with “name”: “Replication Management”.  
    2. You may need to experiment to find the correct section
  15. Go to Add/Remove and uninstall Enterprise manager
  16. Once remove – delete the the folder C:\Program Files\Attunity\Enterprise Manager\data
  17. Install QEM 2022.11.  When prompted do not install Postgres
  18. Log into QEM and check that it is up and running correctly.
  19. Stop the AttunityEnterpriseManager service again
  20. Run the following command:
    aemctl repository import -f C:\temp\qem_repo_202205_Fix.json -w

  21. Start the AEM service and check that it is running correctly
  22. Re-add the QEM and Analytics license 
  23. Go to Settings -> Repository connections.
    Enter in username password set the database as cdcanalytics_2022_11_p
  24. Intialize the repository and start the connector
  25. Stop the AttunityEnterpriseManager service again
  26. Open up C:\Program Files\Attunity\Enterprise Manager\data\Java\GlobalRepository.sqlite in an SQLite editor
  27. Browse the table objects
    Look for name = “AnalyticsConfiguration” and modify the json value “cdcanalytics_2022_11_p” to the original analytics database name
  28. Save SQLite database
  29. Start the AEM service and check that it is running correctly
  30. Go to the server tab.  Edit each server and re-add the password.  Click on the button “Monitor this server’s tasks and messages”.  Test the connection and click OK
  31. PVT QEM

If things go wrong

Backout process

  1. Uninstall QEM
  2. Delete the folder C:\Program Files\Attunity\Enterprise Manager\data
  3. Reinstall either QEM 2021.05 or QEM 2022.05
  4. Stop the AttunityEnterpriseManager service
  5. Restore the appropriate backed up data folder to C:\Program Files\Attunity\Enterprise Manager\data
  6. Restart the AEM service

Why can’t Qlik Replicate and Enterprise manager be friends?

A nice quiet evening with the family was interrupted with a critical alert coming through my phone.

Once again server patching had knocked off our Qlik Replicate node.

Logging in I could see that Enterprise manager could not reach one of our nodes with the following error message in the log:

2022-12-15 19:15:40 [ServerDto ] [ERROR] Test connection failed for server:MY_QR_CLUSTER. Message:'Unable to connect to the remote serverA connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond xxx.xxx.xxx.xxx:443'.

I have seen this problem before and it is usually resolved by:

  1. Failing over the QR windows cluster
  2. Restarting the new passive node

Our IT team is aware of the problem and have been researching into a cause and a fix.  Their prevailing theory was that when the cluster gets failed over in server patching – there are some residual connections to the previous active node.

But tonight after multiple failovers and stopping and starting the QR roles – still Enterprise manager couldn’t connect to that QR node.

I did the following checks:

  1. The log repsrv.log log file had no error messages and the service Web UI service was active
  2. From the Enterprise manager; I could ping the QR cluster address and the active node successfully 
  3. From a Chrome session on the Enterprise manager server; I could not get to the QR Web console
  4. From a Chrome session on the QR server; I could get to the QR Web console

A senior IT member joined the troubleshooting call and suggested that we reboot the Enterprise manager server.

So we did and then we couldn’t access Enterprise manager console.

At this point I wanted to quit IT and become a nomad in Mongolia.

Then the senior IT member worked it out.

The Windows server was randomly turning on the Windows Firewall

This was blocking our inbound connections; making the console inaccessible from other locations – except when you were logged onto the server.

This also explains why when this problem previously arise; restarting the server will eventually work because the server group policy will eventually get applied and turn off the Windows firewall. 

Lessons learnt

If you come above this problem in your environment try accessing the QR console from multiple locations:

  • From the Enterprise Manager server
  • From within the QR server with a local address like: https://localhost/attunityreplicate/login/

Good luck

Qlik Replicate – “Json doesn’t start with ‘{‘ [xxxxxxx] (at_cjson.c:1773)” error

Had a shocker of a week.

You know those weeks; where everything went wrong.

Busy fixing other systems; after Operating system patching done by our IT team I didn’t look closely to our Qlik Replicate nodes that have been running smoothly over the past year 

After all there were no alerts; and a quick glance all our tasks were in a running status and none were in suspended or error status.

Next day One of our junior admin the pointed out some Qlik tasks using a high amount of memory.

I looked in and my stomach dropped.

Although the task was “green” and running; no changes were getting through to the destinations (AWS S3 and GCS).  The log file was filled with errors like:

00002396: YYYY-MM-DDT15:21:14 [AT_GLOBAL ]E: Json doesn't start with '{' [xxxxxxx] (at_cjson.c:1773)
00002396: YYYY-MM-DDT15:21:14 [AT_GLOBAL ]E: Cannot parse json: [xxxxxxx(at_protobuf.c:1420)

And hundreds and thousands of transactions were waiting to be written out.

The problem only existed on one QR cluster and only jobs that were writing to AWS S3 and GCS; the Kafka one was fine.  The other QR clusters were running fine

The usual “Turn it off and on again” didn’t work in either stopping or resuming the task; or restarting the server.

In the end I contacted Qlik Supported.

They hypothesised that the blanked patching caused the Qlik Replicate cluster to fail over and corrupt the captured changes stored up waiting to be written out in the next batch process.  When QR tried to read the captured changes – the json was corrupted.

Their fix strategy was:

  1. Stop the task
  2. Using the log file; find out the last successful time or stream position that the task.  This is usually found at the end of the log files.
  3. Using the Run -> Advance Run option; restart the task from the time last written out.

If this didn’t work; the recommended rebuilding the whole task and following the above steps

Luckily their first steps worked.  After finding the correct timestamps we could restart the QR tasks from the correct position.

Now looking into some alerting to prevent this problem again.

Qlik Replicate – “SYS-E-HTTPFAIL, SYS-E-UNRECREQ, Unrecognized request pattern” problem

After a weekend of patching in our dev environment; we came in to discover that half of our Qlik Replicate Nodes were offline in Enterprise manager with the following error message

4 2022-08-08 13:19:24 [ServerDto      ] [ERROR] Test connection failed for server:MY_QRNODE. Message:'SYS-E-HTTPFAIL, SYS-E-UNRECREQ, Unrecognized request pattern 'GET: /servers/local/adusers/devdomain/qlik_user?directory_lookup=false'..'.

We could still access the node through their individual node’s web gui and the tasks were still happily running in the background.  The other half of our QR nodes were still online in Enterprise manager

The usual troubleshooting of restarting the services and restarting the server didn’t fix the problem; and the patching team washed their hands of the problem.  (I can’t really list the patches they applied due to security issues)

A database administrator mentioned it might be a TLS problem as the server team has been changing TLS settings in various places.  This lead me to comparing the connection settings between a QR node that was working to not working (we checked other things like patch levels etc).

Some servers had the username in domainusername format; while others didn’t.  Coincidentally the ones who had the  domainusername format were the ones not working,

Removing the domain section of the username resolved the problem.

So not sure what in the patching caused this issue; but something we have to check on our prod servers before the patches are applied to them.

Qlik Replicate – The Wide Text file problem

A wide problem

We had a new business requirement for our team to create a new source for a CSV through Qlik Replicate.

OK – simple enough; create the Source Connector and define the file’s schema in the table section of the connector.

But the file was 250 columns wide.

I had nightmares before of defining file schemas before in programs like SSIS where one little mistake can lead to hours of debugging.

A stroke of Luck

Luckily we knew the source of the CSV file was an output from a view of a database; so the first part of the problem was solved – we had the schema.  Now to import the schema into a Qlik Replicate task.

  1. Create a new File Source connector and prefill as much known information as you can
  2. In the Tables definition of the File Source; add a dummy field
  3. Create a NULL Target connector
  4. Create a dummy task with your new file source and the Null target and save it
  5. Export task with End Points to save out the json definition of the task

Creating the table definition in JSON

Knowing the view definition; a simple python script can convert it to json.

import csv
import json

if __name__ == '__main__':
  final_dic = {}
  source_file = "FileDef.txt"
  field_array = []

    with open(source_file, 'r') as f:
      reader = csv.reader(f, delimiter='t')
      schema_items = list(reader)

        for item in schema_items:
          focus_item = {}
          focus_item["name"] = item[0]
          focus_item["nullable"] = True
          focus_string = item[1]

            if item[1].find("NUMBER") != -1:
              focus_item["type"] = "kAR_DATA_TYPE_NUMERIC"
              tokens = focus_string[focus_string.find("(") + 1: -1].split(",")

                precision = tokens[0]
              focus_item["precision"] = int(precision)

                if len(tokens) == 2:
                  scale = tokens[1]
                  focus_item["scale"] = int(scale)

            elif item[1].find("VARCHAR2") != -1:  #VARCHAR2
              focus_item["type"] = "kAR_DATA_TYPE_STR"
              length = focus_string[focus_string.find("(") + 1 : focus_string.find("BYTE") - 1]
              focus_item["length"] = int(length)

            elif item[1].find("DATE") != -1:
              focus_item["type"] = "kAR_DATA_TYPE_TIMESTAMP"

          field_array.append(focus_item)

        columns = {}
      columns["columns"] = field_array

    f = open("out.json", "w")
  f.write(json.dumps(columns, indent=4))
  f.close()

FileDef.txt

The following text file contains the schema of the view with the table delimited columns of:

  • Column Name
  • Data type
  • Nullable
  • Column order
UDF_VARCHAR1 VARCHAR2(200 BYTE) Yes 1
UDF_VARCHAR2 VARCHAR2(200 BYTE) Yes 2
UDF_VARCHAR3 VARCHAR2(200 BYTE) Yes 3
UDF_NUMERIC1 NUMBER(10,6) Yes 4
UDF_NUMERIC2 NUMBER(10,6) Yes 5
UDF_NUMERIC3 NUMBER(10,6) Yes 6
UDF_INT1 NUMBER(10,0) Yes 7
UDF_INT2 NUMBER(10,0) Yes 8
UDF_INT3 NUMBER(10,0) Yes 9
UDF_DATE1 DATE Yes 10
UDF_DATE2 DATE Yes 11
UDF_DATE3 DATE Yes 12

This particular view definition is from Oracle so uses oracle data types; but it would be simple to change the code over to use a definition from a MS SQL database or other source.

Sticking it all back together

  1. Open up the exported task file json
  2. Find the dummy field that you created in the table definition
  3. Overwrite the dummy field with the json created in the script file
  4. Save and reimport the job.

If all goes well; the File source connector will be overwritten with the full table definition.