Instead of getting a bit of a sleep in and then making my way out of bed to make my wife “Mother’s Day Breakfast in Bed” (i.e. An absurd of bacon on toast with an egg perched on top) – I get a call from our first responders saying that after Windows patching he could not get back onto Qlik Enterprise manager to restart our Qlik tasks.
As I grumbled out of bed, I was hoping it was something simple like a Windows Defender firewall getting turned back on.
Little did I know that I ended up working on the problem most of the day; meaning my wife missed her breakfast in bed, brunch in bed, lunch in bed and wine in bed.
Starting from the start
Our Midrange team washed their hands of the issue and it was over to us to get Qlik Enterprise Manager
Logging into my PC and going to the usual QEM address; I got greeted with the Chrome error of:
ERR_CONNECTION_CLOSED
Rightio.
To rule out the usual suspects of browser or VPN; I tried (sigh) Microsoft Edge and also Chrome from a virtual machine that was located in the internal network.
Again returned that ERR_CONNECTION_CLOSED error message
I next tried logging on to the QEM server itself. The server has Chrome installed on it – useful for testing for firewall or DNS problems.
I tried:
Using the DNS name
Using the IP address
Using localhost
This time I got a different error message:
ERR_CONNECTION_RESET
Off to Qlik for help
I raised a case with Qlik Support for help. They speculated it was a DNS issue – which didn’t make sense to me. If it was a DNS issue – I would have expected the IP address to work.
I still had to answer a lot of questions and run lots of nslookup commands. No – the DNS name hasn’t changed. No – the IP address hasn’t change
Nothing really pointed to where the issue was originated from.
SkyNet to the rescue
While waiting for Qlik to get back to me and since we have access to SkyNet Google Gemini Enterprise – curiously I asked the Gen AI.
In a browser on the server myqemaddress; I am trying access Qlik Enterprise manager on address https://myqemaddress/attunityenterprisemanager/ but getting an error ERR_CONNECTION_RESET.
What could be the reason I am getting this error message?
It returned the two usual suspects:
Service is down (Checked – no errors in the Enterprise manager logs)
Proxy issue (We don’t have a proxy between QEM and the clients)
But the third suggestion was interesting:
3. Verify Port Binding and SSL
Enterprise Manager enforces HTTPS (defaulting to port 443).
If the SSL certificate is not correctly bound to the port, the connection will reset.
Check if the port is listening: Open a command prompt as Administrator and run: netstat -ano | findstr :443 If you don’t see a listener, the service hasn’t claimed the port.
Verify SSL Binding: Run: netsh http show sslcert | findstr :443 If no certificate is listed, the HTTPS setup is incomplete.
netsh http show sslcert | findstr :443
IP:port : [::]:443
I ran the same command on our DEV Enterprise manager server that was working and got:
So prod was missing the wildcard address of 0.0.0.0:443
I added it back in using the guide that we have for updating certificates and it worked. Thank you SkyNet Google Gemini Enterprise! You saved the rest of Mother’s day
Why was the SSL binding deleted?
Well that is the question that I’d like to know.
I could see in the logs that the certificate was deleted when patching was happening:
I sent an incident ticket off to Midrange. So far I haven’t heard back from them.
But at least QEM is up and running and the issue is documented so that if you come to this page in desperation; you can get QEM up and running as well and enjoy Mother’s day.
It is a new day and a new target endpoint in Qlik Replicate for us.
What is different; it is a new target endpoint – Big Query; which we have not used before.
Usually for our Date Lake, the steps are Source System -> Qlik Replicate -> JSON files -> GCS -> DBT -> Big query.
But this use case was different, a more tactical solution to get some tables off a legacy Data Warehouse straight into Big Query and bypassing the GCS and DBT combination.
Sharing Credentials.
In our organisation credentials are stored in their own project and are separated from other GCP resources. This is so that the credentials are centrally located in one project and we are not duplicating credentials across multiple projects.
The Cloudies created a Service Account for us in project_x and pointed to a “scratchpad” dataset in project_y for Qlik Replicate to create the tables and copy across the data. We can then use the generated table schema to generate tables subsequent environments where Qlik doesn’t have create permissions.
But the problem was that the credentials were in project_x; but I could not force Qlik Replicate to write to project_y. Initially I thought I needed to prefix the schema with the project; like how it appears in big query:
SELECT * FROM `project_y.scratchpad.my_table`
This didn’t work and looking at the log file with verbose turned up; I could still see it was trying to write to project_x.
While I was waiting for replies on the forum and trying a couple of suggestions; I had a look at the credential json that I downloaded from Secrets manager:
We did an initial run in development as a POC and to shake out any permission issue. So far; no issues and the destination tables were created and written to in Big Query. Once we are a bit more stable; I will do a couple of rough benchmarks between writing to Big Query vs to GCS with json.
The Cloudies and Data Lake developers are now trying to work out what is the best way to manage the permissions by their infrastructure code in higher environments. In the meantime, I am modifying our QR task migration pipeline to have a variable destination table schema as the schema (which relates to dataset in Big Query) will change moving from environment to environment. It should be easy to do since we have python wrapper classes over the top of the Qlik Replicate json code. It will just be a new method to change table owners to a passed in variable.
Just as a safeguard I also did a google search and a forum search to see if anything else turned up.
Again nothing.
I pinged back to my manager.
“QR should capture deletes simply fine. Where did you hear that QR had problems with MongoDB deletes from?”
“Oh. The project manager heard from a Mongo guy that Qlik does not work.”
I bit my lip. To me that seemed the same as getting health advice off a random guy at the pub who saw something on Telegram.
But I realised now that even though I had official Qlik documentation backing me up; the burden of proof was back on me.
Back into the world of containers
Qlik Replicate in a docker container has been an extremely useful tool on my Linux development machine. It enables me to quickly test various aspects of QR without interrupting the official dev environment; or fighting with the cloud team to get a proof-of-concept source system set up.
First, I had to set up a MongoDB docker container. Specifically, the MongoDB had to have a “replica”; a standalone wouldn’t work. Since this was a POC; I didn’t need a fully fledge – multi replica – ultra secure cluster; just something simple and disposable.
The great thing is that we don’t have to mess around installing extra ODBC drivers that can make the Qlik Replicate docker image complex.
So with the MongoDB and the Qlik Replicate docker containers up and running; we can how create a new endpoint with the following settings:
Field
Variable
MongoDB Deployment
Standard
Hosts
host.docker.internal
Authentication Method
SCRAM-SHA-256
Username
root
Password
root
Authentication database name
admin
A test connection confirmed that the connection was working as expected.
Proving that deletes are working
Getting close to resolving the burden of proof. I got our GenAI to create a simple python script to add, update, and delete some data from the Mongo Database and set it running:
import pymongo
import random
import string
import time
def generate_random_string(length=10):
"""Generate a random string of fixed length."""
letters = string.ascii_lowercase
return ''.join(random.choice(letters) for i in range(length))
def add_random_data(collection):
"""Adds a new document with random data to the collection."""
new_document = {
"name": generate_random_string(),
"age": random.randint(18, 99),
"email": f"{generate_random_string(5)}@example.com"
}
result = collection.insert_one(new_document)
print(f"Added: {result.inserted_id}")
def update_random_data(collection):
"""Updates a random document in the collection."""
if collection.count_documents({}) > 0:
# Get a random document using the $sample aggregation operator
try:
random_doc_cursor = collection.aggregate([{"$sample": {"size": 1}}])
random_doc = next(random_doc_cursor)
new_age = random.randint(18, 99)
collection.update_one(
{"_id": random_doc["_id"]},
{"$set": {"age": new_age}}
)
print(f"Updated: {random_doc['_id']} with new age {new_age}")
except StopIteration:
print("Could not find a document to update.")
else:
print("No documents to update.")
def delete_random_data(collection):
"""Deletes a random document from the collection."""
if collection.count_documents({}) > 0:
# Get a random document using the $sample aggregation operator
try:
random_doc_cursor = collection.aggregate([{"$sample": {"size": 1}}])
random_doc = next(random_doc_cursor)
collection.delete_one({"_id": random_doc["_id"]})
print(f"Deleted: {random_doc['_id']}")
except StopIteration:
print("Could not find a document to delete.")
else:
print("No documents to delete.")
if __name__ == "__main__":
print("Starting script... Press Ctrl+C to stop.")
try:
client = pymongo.MongoClient("mongodb://root:root@localhost:27017/?authSource=admin")
db = client["random_data_db"]
collection = db["my_collection"]
# The ismaster command is cheap and does not require auth.
client.admin.command('ismaster')
print("MongoDB connection successful.")
except pymongo.errors.ConnectionFailure as e:
print(f"Could not connect to MongoDB: {e}")
exit()
while True:
try:
add_random_data(collection)
add_random_data(collection)
add_random_data(collection)
update_random_data(collection)
delete_random_data(collection)
# Wait for five seconds before the next cycle
time.sleep(5)
except KeyboardInterrupt:
print("\nScript stopped by user.")
break
except Exception as e:
print(f"An error occurred: {e}")
break
# Close the connection
client.close()
print("MongoDB connection closed.")
With the script running and adding, updating and deleting data to the MongoDB; a QR task can be created to read from the database “random_data_db” and collection “my_collection”
And voilà. Data is flowing through Qlik Replicate from the MongoDB.
And more importantly deletes are getting picked up as expected.
And voilà. Data is flowing through Qlik Replicate from the MongoDB.
And more importantly; deletes are getting picked up as expected.
If you have come to my humble website after googling “Can Qlik Replicate replicate deletes from a MongoDB?” well, the answer is “Yes – Yes it can.”
I can only speculate where that rumour came from. Maybe it was from an older version of QR or MongoDB that the person in question was referring to? Or some very abnormal setup of a MongoDB cluster?
Or some complete misunderstanding.
Anyway, I have meeting in the next hour or two with the project team; let us find out.
With the post festive system haze still on – I spent a full day trying to work out another encoding problem that came up in our exception table.
Let me take you on a journey.
It starts with MS-SQL.
We have Qlik Replicate replicating data from a DB2/zOS system into a MS-SQL system. I was investigating why there were inserts missing from the target table and exploring the dbo.attrep_apply_exceptions table I notice a bunch of INSERTS from another table in the exception table:
/*
CREATE TABLE dbo.destination
(
add_dte numeric(5, 0) NOT NULL,
chg_dte numeric(5, 0) NOT NULL,
chg_tme varchar(6) NOT NULL,
argmt varbinary(10) NOT NULL,
CONSTRAINT pk_destination PRIMARY KEY CLUSTERED
(
argmt
) WITH (DATA_COMPRESSION = ROW) ON [PRIMARY]
) ON [PRIMARY]
*/
INSERT INTO [dbo].[destination] ( [add_dte],[chg_dte],[chg_tme],[argmt]) VALUES (46021,46021,'151721', 0x35333538C3A302070000);
I thought it would be a good place to start by substituting the failed insert into a temp table and using the temp table to investigate downstream data flows.
So:
DROP TABLE IF EXISTS ##temp_destination;
SELECT *
INTO ##temp_destination
FROM [dbo].[destination]
WHERE 1 = 0;
INSERT INTO ##temp_destination ( [add_dte],[chg_dte],[chg_tme],[argmt]) VALUES (46021,46021,'151721', 0x35333538C3A302070000);
But as I was substituting the temp table into objects downstream – I was just getting logical errors. Comparing the temp table against other data that already existed dbo.destination; it looked not right.
Working Upstream
I thought it would be a good place to start by substituting the failed insert into a temp table and using the temp table to investigate downstream data flows.
Since we just merged a system into our DB2/zOS system; my initial thought it was a corrupted record coming from a merge process. I contacted the upstream system owners and asked for help.
They came back and said the record was fine and the issue must be in the QR system converting the data wrong. This did not make sense – if there was a issue in the QR system; it would have been picked up in testing. Also, there would be lots of exceptions in dbo.attrep_apply_exceptions. I had confidence that QR was handling it right.
Not trusting what the DB2/zOS team had told me; I created a temp Qlik task to dump out that table into a test database.
Using the other values in the record; I could look the record up in the table and compare the two.
For the field argmt it came back as:
0xF5F3F5F846022F00000F
This looked nothing like what was in the exception table.
EBCDICy epiphany
Of course – it took a while; but it finally struck what was going on.
The field argmt had sections encoded in EBCDIC. This was getting brought across as bytes to MS-SQL. There was a downstream trigger that was failing in the MS-SQL database. But when it was getting written out into the exception table; it was getting written as ASCII (or possibly Windows-Latin) encoding.
(Only showing start of bytes as this is what we use downstream)
With me just copying the insert statement and using it as is; I was treating ASCII bytes as EBCDIC; causing all sorts of logical errors downstream.
Where to from here?
I’m going to raise this abnormality with Qlik. I am not sure if it is a “bug” or a “feature”; but good to get clarification on that point.
I also will need to educate our users on how to interpret the values in dbo.attrep_apply_exceptions so not to stumble into mistakes that I did.
As for fixing the data itself – since we have the full dump of data from the table and the data is slow moving; I can substitute the values from the data dump into the problem INSERT statement.
To make our lives easier migrating Qlik Replicate tasks from one environment to another; I am looking how to migrate Qlik Replicate schedules.
So far by GitLab pipelines we can migrate tasks and endpoint; but schedules we must do manually. Quite often we forget to manually create the schedules as they are not as obvious component compared to tasks and endpoints. If you move to a new environment and your task is not there, well that is a no brainer. If your endpoint is not there; your task will not migrate.
But schedules are something that lurks in the background; inconsistently created when we rush to put them in when we realise the Qlik task did not run when it was meant to.
Plan
Using the python API, we can export all the items from a Qlik Replicate server with the export_all method. In the returned Json file will be the schedule details like this:
The idea is to modify the Json for the schedule to match the environment we are migrating to; put the schedule into a blank server Json template and use the import_all
method to upload the new schedules to the destination server.
The Cron does not look right…
Looking over the exported Json; the cron syntax did not look right. It had six fields instead of the expected five.
"schedule": "0 1 * * * *"
This confused me for a while as I haven’t come across a cron like this before.
After coming up with no results searching Qlik’s documentation; I created some test schedules to try and determine what the sixth field is used for.
Ahh. After a short while I found the answer.
If the schedule is a “Once off” run; the sixth field is used for “Year”.
I think as Qlik Replicate developers; we have all been here before.
The testers berating you through MS Teams saying, “We have made a change, and we cannot see it downstream! It is all Qlik Replicate’s fault!”
Opening the task and checking the monitoring screen – I can see the change against the table. What are they going on about? Qlik has picked up something; why can’t they see it? (Sixty percent of the time; something downstream has failed, twenty percent of the time something upstream has failed, nineteen percent of the time they are looking in the wrong spot and the remaining one percent of the time – well we won’t mention that one percent.)
But looking at the monitoring screen – what does those numbers mean? Also, if you look in the analytics database; what does those figures mean?
Hopefully, this article will help you understand the monitoring and analytics numbers with some simple examples.
Filters – why does it have to be filters?
We have two main types of Qlik Replicate tasks. One type grabs all changes from a particular table and sends it to our Data Lake in micro batches of fifteen minutes.
The other type are our speed tasks; only grabbing changes on specific columns on a table. To limit QR to only picking up specific changes; we have filters like this:
($AR_H_OPERATION != 'UPDATE' AND $AR_H_OPERATION != 'DELETE') OR
(
($AR_H_OPERATION == 'UPDATE') AND
(
( $BI__FIELD_1 != $FIELD_1 ) OR ( $BI__FIELD_1 IS NULL AND $FIELD_1 IS NOT NULL ) OR ( $BI__FIELD_1 IS NOT NULL AND $FIELD_1 IS NULL ) OR
( $BI__FIELD_2 != $FIELD_2 ) OR ( $BI__FIELD_2 IS NULL AND $FIELD_2 IS NOT NULL ) OR ( $BI__FIELD_2 IS NOT NULL AND $FIELD_2 IS NULL ) OR
( $BI__FIELD_3 != $FIELD_3 ) OR ( $BI__FIELD_3 IS NULL AND $FIELD_3 IS NOT NULL ) OR ( $BI__FIELD_3 IS NOT NULL AND $FIELD_3 IS NULL )
)
)
So, the question is – if we made an update to $FIELD_4 meaning that QR will send the change downstream; what would the monitoring tab on the task display?
Let is create an example to find out.
An example
Here is a simple table -pre-populated with six record:
CREATE TABLE dbo.CDC_ANALYTICS_EXAMPLE
(
RECORD_ID INT IDENTITY(1,1) PRIMARY KEY,
DATA_INT INT,
DATA_VARCHAR VARCHAR(100),
INCLUDE_RECORD CHAR(1)
);
GO
INSERT INTO dbo.CDC_ANALYTICS_EXAMPLE VALUES(1, 'Record 1', 'Y');
INSERT INTO dbo.CDC_ANALYTICS_EXAMPLE VALUES(2, 'Record 2', 'Y');
INSERT INTO dbo.CDC_ANALYTICS_EXAMPLE VALUES(3, 'Record 3', 'Y');
INSERT INTO dbo.CDC_ANALYTICS_EXAMPLE VALUES(4, 'Record 4', 'N');
INSERT INTO dbo.CDC_ANALYTICS_EXAMPLE VALUES(5, 'Record 5', 'N');
INSERT INTO dbo.CDC_ANALYTICS_EXAMPLE VALUES(6, 'Record 6', 'N');
SELECT *
FROM dbo.CDC_ANALYTICS_EXAMPLE WITH (NOLOCK);
And we create a simple QR task with the following filters:
The first filter is the Fullload Passthru Filter. When the full load initially runs; only three of the six records will be brought across.
The second filter Record Selection Condition; should filter any data changes on when $INCLUDE_RECORD has a value of ‘Y’.
Results after running the QR task
Full load
After running the full load; qlik shows the following on the monitoring screen:
Three records – that’s what we expected out of the six.
Interesting under the “Total Completion” section; the distinction of the filter is quite clear. Three records brought across and three remaining.
Change Processing
Let’s make some changes to the data in our test table:
-- This will be captured
BEGIN TRANSACTION
DECLARE @IDENTITY INT;
INSERT INTO dbo.CDC_ANALYTICS_EXAMPLE VALUES(null, 'Record x', 'Y');
SET @IDENTITY = @@IDENTITY;
UPDATE dbo.CDC_ANALYTICS_EXAMPLE
SET DATA_INT = @IDENTITY,
DATA_VARCHAR = 'Record ' + CAST(@IDENTITY AS varchar)
WHERE RECORD_ID = @IDENTITY;
DELETE dbo.CDC_ANALYTICS_EXAMPLE WHERE RECORD_ID = @IDENTITY;
COMMIT;
GO
-- This will be excluded
BEGIN TRANSACTION
DECLARE @IDENTITY INT;
INSERT INTO dbo.CDC_ANALYTICS_EXAMPLE VALUES(null, 'Record x', 'N');
SET @IDENTITY = @@IDENTITY;
UPDATE dbo.CDC_ANALYTICS_EXAMPLE
SET DATA_INT = @IDENTITY,
DATA_VARCHAR = 'Record ' + CAST(@IDENTITY AS varchar)
WHERE RECORD_ID = @IDENTITY;
DELETE dbo.CDC_ANALYTICS_EXAMPLE WHERE RECORD_ID = @IDENTITY;
COMMIT;
This is what the monitoring screen shows:
Looking at the Recent activity screen; hmm – this could lead to some confusion.
The metrics shows the two blocks of changes; even though one block was filtered out. So this is indicating that this gives the results when there is a change in the table; not matter if it was subsequently filtered out.
If those testers in the opening paragraph created a record that was filtered out in QR; we could give them false direction if we said “it should be downstream”
If we look at the Aggregates screen; it paints a true picture of what was detected and transferred to the target:
We can see the three records in the “Total Changes Applied” column.
Analytics database
From the analytic database:
SELECT
*
FROM public.aem_taskv t
WHERE
t.task_name IN ('TEST_ANALYTICS')
ORDER BY
t.server_name, t.task_name, t.retrieval_time
Field
Value
id
110876201
retrieval_time
2025-12-03 12:02
server_id
5
task_name_id
914
task_state_id
3
task_stop_reason_id
1
task_profile_id
1
cdc_evt_applied_insert_count
1
cdc_evt_applied_update_count
1
cdc_evt_applied_delete_count
1
cdc_evt_applied_ddl_count
0
full_load_tables_completed_count
1
full_load_tables_loading_count
0
full_load_tables_queued_count
0
full_load_tables_with_error_count
0
full_load_total_records_transferred
3
full_load_est_records_count_for_all_tables
6
full_load_completed
1
full_load_start
2025-12-03 10:57
full_load_finish
2025-12-03 10:57
full_load_thrput_src_thrput_records_count
0
full_load_thrput_src_thrput_volume
0
full_load_thrput_trg_thrput_records_count
0
full_load_thrput_trg_thrput_volume
0
cdc_thrput_src_thrput_records_count
0
cdc_thrput_src_thrput_volume
0
cdc_thrput_trg_thrput_records_count
0
cdc_thrput_trg_thrput_volume
0
cdc_trans_read_rollback_count
0
cdc_trans_read_records_rollback_count
0
cdc_trans_rollback_change_volume
0
cdc_trans_applied_transactions_in_progress_count
0
cdc_trans_applied_records_in_progress_count
0
cdc_trans_applied_comitted_transaction_count
2
cdc_trans_applied_records_comitted_count
6
cdc_trans_applied_volume_committed
0
cdc_trans_read_memory_events_count
0
cdc_trans_read_swapped_events_count
0
cdc_trans_applied_memory_events_count
0
cdc_trans_applied_swap_events_count
0
cdc_source_latency
0
cdc_apply_latency
0
memory_usage_kb
0
disk_usage_kb
0
cpu_percentage
0
data_error_count
0
task_option_full_load_enabled
1
task_option_apply_changes_enabled
0
task_option_store_changes_enabled
1
task_option_audit_changes_enabled
0
task_option_recovery_enabled
0
cdc_trans_read_in_progress
2
server_cpu_percentage
0
machine_cpu_percentage
16
tasks_cpu_percentage
0
server_name
xxxx
task_name
TEST_ANALYTICS
server_with_task_name
xxxx::::TEST_ANALYTICS
A couple of take aways from the results:
Unfortunately the records in the the analytic database is grouped up at a task level; not at an individual table level. This makes it harder to determine if a change from a particular table was applied
cdc_trans_applied_records_comitted_count is a count of all the records read; but not necessarily applied
If you want the number of records flowing down to downstream; add together the fields cdc_evt_applied_insert_count, cdc_evt_applied_update_count and cdc_evt_applied_delete_count together
Final thoughts
The monitoring and analytics screen is useful tools for diagnosing QR tasks – and general ASMR; watching the numbers bounce up as transactions flow through the system.
Understanding the figures of what they translate to – whether changes getting detected or applied downstream is fundamental in diagnosing problems.
Nothing is better than getting access to the source and target systems and checking yourself to confirm if things are working correctly; but the monitoring screen is a good place to start investigating.
We have been doing a bit of “Stress and Volume” testing in Qlik Replicate over the past few days; investigating how my latency is introduced to a MS SQL server task if we run it through a log stream.
If you’re not aware – you can get minute latecy from Qlik Replicate by turning the “Performance” logs up to “Trace” or higher:
This will result in messages getting produced in the task’s log file like:
I created a simple python script to parse a folder of log files and output it in a excel. Since the data ends up in a panda data frame; it would be easy to manipulate the data and output it in a specific way:
import os
import pandas as pd
from datetime import datetime
def strip_seconds(inString):
index = inString.find(" ")
returnString = float(inString[0:index])
return(returnString)
def format_timestamp(in_timestamp):
# Capture dates like "2025-08-01T10:42:36" and add a microsecond section
if len(in_timestamp) == 19:
in_timestamp = in_timestamp + ":000000"
date_format = "%Y-%m-%dT%H:%M:%S:%f"
# Converts date string to a date object
date_obj = datetime.strptime(in_timestamp, date_format)
return date_obj
def process_file(in_file_path):
return_array = []
with open(in_file_path, "r") as in_file:
for line in in_file:
upper_case = line.upper().strip()
if upper_case.find('[PERFORMANCE ]') >= 0:
timestamp_temp = upper_case[10:37]
timestamp = timestamp_temp.split(" ")[0]
split_string = upper_case.split("LATENCY ")
if len(split_string) == 4:
source_latency = strip_seconds(split_string[1])
target_latency = strip_seconds(split_string[2])
handling_latency = strip_seconds(split_string[3])
# Makes the date compatible with Excel
date_obj = format_timestamp(timestamp)
excel_datetime = date_obj.strftime("%Y-%m-%d %H:%M:%S")
# If you're outputting to standard out
#print(f"{in_file_path}\t{time_stamp}\t{source_latency}\t{target_latency}\t{handling_latency}\n")
return_array.append([in_file_path, timestamp, excel_datetime, source_latency, target_latency, handling_latency])
return return_array
if __name__ == '__main__':
log_folder = "/path/to/logfile/dir"
out_excel = "OutLatency.xlsx"
latency_data = []
# Loops through files in log_folder
for file_name in os.listdir(log_folder):
focus_file = os.path.join(log_folder, file_name )
if os.path.isfile(focus_file):
filename, file_extension = os.path.splitext(focus_file)
if file_extension.upper().endswith("LOG"):
print(f"Processing file: {focus_file}")
return_array = process_file(focus_file)
latency_data += return_array
df = pd.DataFrame(latency_data, columns=["File Name", "Time Stamp", "Excel Timestamp", "Source Latency", "Target Latency", "Handling Latency"])
df.info()
# Dump file to Excel; but you can dump to other formats like text etc
df.to_excel(out_excel)
And voilà – we have an output to Excel to quickly create statistics on latency for a given task(s).
What statistics to use?
Latency is something you can analyse in different ways, depending on what you’re trying to answer. It is also important to pair latency statistics with the change volume that is coming through.
Has the latency jumped at a specific time because the source database is processing a daily batch? Is there a spike of latency around Christmas time where there are more financial transactions compared to a benign day in February?
Generally, if the testers are processing the latency data they provide back:
Average target latency
90th percentile target latency
95th percentile target latency
Min target latency
Max target latency
This is what we use to compare two runs to each other when the source load is consistent and we’re changing a Qlik Replicate task
I am just writing a brief post about a conversation I was asked to join.
The question paraphrased:
For dates in an Microsoft SQL Server, when they get passed through Qlik Replicate to Kafka Avro – how are they interpreted? Epoch to utc? Or to the current time zone?
I didn’t know the answer so I created the following simple test:
For over fifteen years I have worked with a Microsoft SQL Data Warehouse that is over twenty-five years old. Although it is ancient in this world of huge cloud-based Data Vaults – it churns out an abundant of value to the organisation I work for.
The problem with this value – it is opened to anyone who wants to access to build their queries off. This means we get a range of SQL queries running on the database. We have queries ranging from precise and efficient built queries that uses every trick in optimisation – all the way to “WTF are you trying to do?!” queries.
The second category is the one we have the most problem with. With limited resources of an OnPrem SQL server; some uses write atrocious queries without a perceived care for other people using the database. There is nothing more frustrating getting a support call out because of a failed data load; only to find the table is locked by someone running a query over the past six hours.
To counter this problem; I created a python script that checks what is running on the database every fifteen minutes. If a user has a query running for over half an hour, I get an alert to show what they are running.
This allows our team to:
Work out if someone is running a query that is blocking or taking resources away from our critical processes – resulting in timeouts
With our experience help users optimise poorly running queries so they have a better outcome
To stop any stupid queries from running
Since the queries are logged into a DB – I have four years of history of 14,000 unique queries running. This gives me a lot of learning experiences of what users are running and what problems they face.
From this learning – this is the Top 10 most horrible queries I see running on our database.
8. The “Discovery” Query
This query come from new and experience users exploring the data of the database. They will run something like this:
SELECT *
FROM dbo.A_VERY_LARGE_TABLE;
Look at the data, grab what they need and then just leave the query running in the background in their Server Management Studio. When we contact them an hour later; they act surprise that the query is still running in the background – returning millions upon millions of rows to their client.
This is an easy fix with training. Using the TOP command to limit the number of rows returned, use SP_HELP to return a schema of an object, provide a sandbox for people to explore in are all relatively simple fixes for this problem
7. The “Application Preview” Query
Our SAS application have a nasty default setting that if a user previews a data in a table; it will try and return the whole table – with the user completely unaware that this is happening.
Even worse; if the application locks up returning the data; the user kills the application from task manager, open it back up and performs the same steps. Since the first query was not killed, it will still be running on the database. So quite often you will see half a dozen of the same queries running on the database, all starting at different times.
To prevent this – it is important to assess new DB client software to see what it is trying to do in the background of the database. What does “preview” mean? Return 10, 100 or all the rows? Is there a setting that restricts the number of rows returned; or a timeout function that prevents complex views running forever just returning a small set of data?
If these features are available – it is important to either set this as default in the application; or part of the initial user setup / training program.
6. The “I am going to copy this to Excel and Analyse the Data” Query
This is like exploratory queries that come up. The user is running a query like:
SELECT *
FROM dbo.A_VERY_LARGE_TABLE;
“Why are you running this query?” I politely enquire. “Oh – I am going to copy the data into excel and search/pivot/do fancy things with it” “Ummm – this table has 8 billion rows in it…”
I can sympathise with queries like these.
The user thinks, “Why run the same query over and over again to get the data that I want to explore and present in different ways. Let just get one large cut and then manipulate it on the client.”
What users don’t realise that dealing with a huge amount of data on the client side is difficult. Exporting data out of the query client, into an application like excel can be frustrating with memory and CPU limitations compared to a spec’d-out server. Plus, Excel is not going to handle 8 billion rows.
The requirements of queries like these needs to be analysed – do the users need that atomic level data? Can strategic aggregation tables be provided to the user, so they have more manageable data to return. Can different technology like OLAP cubes be built on top of large tables to answer the user’s needs?
5. The “I got duplicates so I am going to get rid of them with a DISTINCT” Query
This is a pet peeve of mine.
The user has duplicate in their results and don’t know why. Instead of investigating why there is duplication (usually result of a poor join), they just slap a DISTINCT at the top of the query and call it done. Here is a simplified example that I regularly come across:
DECLARE @ACCOUNTS TABLE
(
ORG INT,
ACCOUNT_ID INT,
BRANCH INT,
BALANCE NUMERIC(18,2),
PRIMARY KEY (ORG, ACCOUNT_ID)
);
INSERT INTO @ACCOUNTS VALUES(1, 18, 60, 50);
INSERT INTO @ACCOUNTS VALUES(1, 19, 60, 150);
DECLARE @BRANCHES TABLE
(
ORG INT,
BRANCH INT,
BRANCH_NAME VARCHAR(20),
BRANCH_STATE VARCHAR(3),
PRIMARY KEY (ORG, BRANCH)
);
INSERT INTO @BRANCHES VALUES(1, 60, 'Branch of ORG 1', 'VIC');
INSERT INTO @BRANCHES VALUES(2, 60, 'Branch of ORG 2', 'VIC');
SELECT
A.ORG,
A.ACCOUNT_ID,
A.BRANCH,
A.BALANCE,
B.BRANCH_STATE
FROM @ACCOUNTS A
JOIN @BRANCHES B
ON -- B.ORG = A.ORG -- User forgot this predicate in the join
B.BRANCH = A.BRANCH;
Results:
ORG
ACCOUNT_ID
BRANCH
BALANCE
BRANCH_STATE
1
18
60
50.00
VIC
1
18
60
50.00
VIC
1
19
60
150.00
VIC
1
18
60
150.00
VIC
The user looks – “Oh Dear – I’ve got duplicates! Let’s get rid of them.”
SELECT DISTINCT
A.ORG,
A.ACCOUNT_ID,
A.BRANCH,
A.BALANCE,
B.BRANCH_STATE
FROM @ACCOUNTS A
JOIN @BRANCHES B
ON -- B.ORG = A.ORG -- User forgot this predicate in the join
B.BRANCH = A.BRANCH;
This poses many problems:
The Database must work harder using incomplete joins on indexes to bring across the data; and then work harder supressing the duplicates
If the field list change; then the distinct might not work anymore.
For example, with the above query – if the user brings in the column “BRANCH_NAME” then the duplicates return:
SELECT DISTINCT
A.ORG,
A.ACCOUNT_ID,
A.BRANCH,
A.BALANCE,
--------- Add in new column ---------
B.BRANCH_NAME,
-------------------------------------
B.BRANCH_STATE
FROM @ACCOUNTS A
JOIN @BRANCHES B
ON -- B.ORG = A.ORG -- User forgot this predicate in the join
B.BRANCH = A.BRANCH;
ORG
ACCOUNT_ID
BRANCH
BALANCE
BRANCH_NAME
BRANCH_STATE
1
18
60
50.00
Branch of ORG 1
VIC
1
18
60
50.00
Branch of ORG 2
VIC
1
19
60
150.00
Branch of ORG 1
VIC
1
18
60
150.00
Branch of ORG 2
VIC
Preventing queries like this comes down to user experience and training. If they see duplicates in their data – their initial thoughts should be “Why do I have duplicates?”
Duplicates might be legitimate and a DISINCT might be OK – but quite often it is because of a bad join. For an inexperience user – it might be overwhelming to break down a large table to find where the duplicates are coming from, and they might need help from a more experienced user. This is where internal data forums are useful – where people can help each other with their problems.
4. The “Ever lengthening” Query
This is a query I see scheduled daily for users getting trends over time for a BI tool:
SELECT *
FROM dbo.A_TABLE A
JOIN dbo.B_TABLE B
ON B.KEY_1 = A.KEY_1
WHERE
--------------- Start Date ---------------
A.BUSINESS_DATE >= '2025-01-01' AND
------------------------------------------
A.PREDICATE_1 = 'X' AND
B.PREDICATE_2 = 'Y';
Initially it starts off OK – running fast and the user is happy. But as time goes by; the query gets slower and slower; returning more and more data. Eventually it gets to the point the query goes off into the nether and never returns. Since it is always starting the from the same point; it is constantly re-querying the same data over an over again
There are a couple of options you can handle queries like these:
Really examine the requirements of the user’s needs. How much data is really relevant for their trending report? In 2027; will data from 2025 be useful for the observer of the report?
Create a table specific for this report and append a timeframe of data (eg daily) onto it. The report than can just query this specific table and not having to regenerate complex joins over and over again for previous timeframes.
3. The “Linked Database” Query
With the growth of our Data Warehouse, other database on different servers started building process flows off our database; sometimes without out knowledge that they are doing this until we see a linked server connection coming up.
What pattern they use our database is where the problems lie. One downstream database runs this query every day:
SELECT *
FROM REMOTE_SERVER.SOME_DATABASE.dbo.A_BIG_HISTORICAL_TABLE;
So, they are truncating a landing table on their database and grabbing the whole table. Hundreds of millions of rows transferring across a slow linked server; taking hours and hours. And with time; this will get slower and slower as the source table grows its history.
This is a tricky one to tackle with the downstream users. Odds are if they design a cumbersome process like this their appetite for change to a faster (yet more complex solution) might not be high; especially if their process ‘works’.
If you can get around the political hurdle and convince them the process must change; there are many options to improve the performance.
Use a ETL tool to transfer the data across. Even a simple python script to copy batch by bath files across will be quicker than using a linked server.
Assess the downstream use cases for the data. Do they need the whole table with the complete history to satisfy their requirements? Do they need all the columns; or can some be trimmed off to reduce the amount of data getting transferred?
If the source table a historical table; only bring across the period deltas. With the downstream process I am having trouble with now; they are bringing across twenty-one million rows daily. If they only bring across the deltas from the night loads it brings across a whopping five thousand rows. That is 0.02% of the total row count. Add in a row count check between the two systems to have confidence that the two tables are in sync and the process will be astronomically quicker.
2. The “I don’t know where that query comes from” Query.
We have a continuing problem from a Microsoft Power BI report that runs a query that looks at data from the beginning of the month to the current date. It was a poorly written query and therefore as the month went along its performance got worse and worse.
Since the username associated with the query was from the Power BI server’s account – we had no idea who the query belonged to as a couple of simple fixes could drastically improve the performance.
We contacted the team that manages Power BI server and asked them to find the query and update the sql code.
They said it was completed, and the report sql was updated.
But soon the sql was back – running for hours and hours.
We contacted the team again –
“Hey that query is back.”
They try updating the report again; but soon the query was back again.
So, either someone was constantly uploading an original copy – or another copy of the report was buried somewhere on the server that the admins could not find.
Since we do not have access to their system, it hard to determine what is happening.
The barbaric solution would be to block the Power BI user from our system; but goodness knows how many critical business processes that will disrupt.
The best solution at the minute we are doing is just constantly killing the running SQL code on the database with the hope that someone will identify the constantly failing report. This can be tricky as well if the Power BI server automatically tries rerunning failed reports.
1. The “I don’t listen to your advice” Query.
Disclaimer – this is a frustration rant that I have to get off my chest.
We have a user that constantly runs a long running crook query. As in when I collated all the long running queries in research for this post – his was at the top by a significant margin.
The problem with the query itself is a join using an uncommonly used business key in a table that is not indexed. The fix itself is quite simple fix; use the primary keys in join.
But he has been running the same code for months – locked our nightly loads several times and caused incidents.
We tried the carrot approach. “Hey here is optimised code. It will make your query run quicker and be less burden on our database.” Got indifferent replies.
More locked loads we included his manager into correspondence, but she did not seem to want to be involved.
With my carrot approach communication, I gained the impression that he had a self-righteous personality and thinks his role and report is above question.
Enough was enough – it was the stick approach time.
We raised an operational risk against his activity.
The Ops risk manager asks, “Do you have evidence that you attempted to reason and help?”
“Yep,” I replied, “Here is a zipped-up folder of dozens of emails and chat messages that I sent him.”
Senior managers were involved; with my senior manager commenting to my manager in the background “Is that guy all there?”
Anyway, my manager wrote a very terse email saying to correct their query and if they crash the loads again; they will get their access removed from the database.
No committal reply. No apology. And to this day they are still running the same query; but just under the radar that it is not locking our loads.
I am watching him like a hawk – waiting for the day that I can cut his access from the database.
It is a pity that our Database does not have a charge back capability of processes used. I bet if his manager got a bill of $$$ from one person running one query; she would be more proactive in this problem.
In retrospect, I would have campaigned to have his access cut right away and make him justify why he should have it back. When there is hundreds of other people doing the right thing on the database; it is not fair that their deliverables are getting impacted by the indifference attitude of one user.
My sister in-law runs a dog training business in Queensland. Part of her business is to track records of her canine clients – especially notes, vaccinations when they’re due, medical records and certificates.
In a previous job she had experience with Animal Shelter Manager (ASM3). She’s familiar with the features and interface to know it will cover her needs.
Her business is not big enough to justify the price of the SaaS version of ASM3; so being the tech savvy (debatable) one the family – it was my task getting it up and running for her.
This lead to several nights of struggling, annoyance and failure.
Fitting the pieces together
To start off I wanted to get a demo version running so I can see what I need to do to deploy it for her.
Apart from a domain name that was sure to set off all the “appropriate content” filters at work; with a few modifications I could get it to work. Looking at the instructions from the author Ræn; it is substantially different to the old Dockerfile and the instructions on the ASM3 home page.
Let’s build it
With my base working version – I cobbled up some Dockerfiles for ASM3 and postres and a docker compose file to tie them together:
(Note that this is not a production version and have to obscure passwords etc in the final version)
The containers build just fine and fire up with no problem.
Vising the website
http://localhost/
ASM3 redirects and builds the database – but then goes to a login page. I enter the username and password; but it loops back to the login page.
I think the problem lies with the base_url and service_url in asm3.conf; possibly with http-asm3.conf settings.
Anyway – I logged a issue with ASM3 see if it is something simple that I missed; or maybe I have to start pulling apart of source code to find what it is trying to do.
Websites store cookies to enhance functionality and personalise your experience. You can manage your preferences, but blocking some cookies may impact site performance and services.
Essential cookies enable basic functions and are necessary for the proper function of the website.
Name
Description
Duration
Cookie Preferences
This cookie is used to store the user's cookie consent preferences.
30 days
These cookies are needed for adding comments on this website.
Name
Description
Duration
comment_author
Used to track the user across multiple sessions.
Session
comment_author_email
Used to track the user across multiple sessions.
Session
comment_author_url
Used to track the user across multiple sessions.
Session
Statistics cookies collect information anonymously. This information helps us understand how visitors use our website.
Google Analytics is a powerful tool that tracks and analyzes website traffic for informed marketing decisions.
Contains information related to marketing campaigns of the user. These are shared with Google AdWords / Google Ads when the Google Ads and Google Analytics accounts are linked together.
90 days
__utma
ID used to identify users and sessions
2 years after last activity
__utmt
Used to monitor number of Google Analytics server requests
10 minutes
__utmb
Used to distinguish new sessions and visits. This cookie is set when the GA.js javascript library is loaded and there is no existing __utmb cookie. The cookie is updated every time data is sent to the Google Analytics server.
30 minutes after last activity
__utmc
Used only with old Urchin versions of Google Analytics and not with GA.js. Was used to distinguish between new sessions and visits at the end of a session.
End of session (browser)
__utmz
Contains information about the traffic source or campaign that directed user to the website. The cookie is set when the GA.js javascript is loaded and updated when data is sent to the Google Anaytics server
6 months after last activity
__utmv
Contains custom information set by the web developer via the _setCustomVar method in Google Analytics. This cookie is updated every time new data is sent to the Google Analytics server.
2 years after last activity
__utmx
Used to determine whether a user is included in an A / B or Multivariate test.
18 months
_ga
ID used to identify users
2 years
_gali
Used by Google Analytics to determine which links on a page are being clicked
30 seconds
_ga_
ID used to identify users
2 years
_gid
ID used to identify users for 24 hours after last activity
24 hours
_gat
Used to monitor number of Google Analytics server requests when using Google Tag Manager