Had a shocker of a week.
You know those weeks; where everything went wrong.
Busy fixing other systems; after Operating system patching done by our IT team I didn’t look closely to our Qlik Replicate nodes that have been running smoothly over the past year
After all there were no alerts; and a quick glance all our tasks were in a running status and none were in suspended or error status.
Next day One of our junior admin the pointed out some Qlik tasks using a high amount of memory.
I looked in and my stomach dropped.
Although the task was “green” and running; no changes were getting through to the destinations (AWS S3 and GCS). The log file was filled with errors like:
00002396: YYYY-MM-DDT15:21:14 [AT_GLOBAL ]E: Json doesn't start with '{' [xxxxxxx] (at_cjson.c:1773)
00002396: YYYY-MM-DDT15:21:14 [AT_GLOBAL ]E: Cannot parse json: [xxxxxxx(at_protobuf.c:1420)
And hundreds and thousands of transactions were waiting to be written out.
The problem only existed on one QR cluster and only jobs that were writing to AWS S3 and GCS; the Kafka one was fine. The other QR clusters were running fine
The usual “Turn it off and on again” didn’t work in either stopping or resuming the task; or restarting the server.
In the end I contacted Qlik Supported.
They hypothesised that the blanked patching caused the Qlik Replicate cluster to fail over and corrupt the captured changes stored up waiting to be written out in the next batch process. When QR tried to read the captured changes – the json was corrupted.
Their fix strategy was:
- Stop the task
- Using the log file; find out the last successful time or stream position that the task. This is usually found at the end of the log files.
- Using the Run -> Advance Run option; restart the task from the time last written out.
If this didn’t work; the recommended rebuilding the whole task and following the above steps
Luckily their first steps worked. After finding the correct timestamps we could restart the QR tasks from the correct position.
Now looking into some alerting to prevent this problem again.