Continuing on the story
After concluding that the low TPS is not resulting from poor query performance; our attention was turned to the network latency between our OnPrem Qlik system and the AWS RDS database.
First, I asked the networks team if there were any suspect networking components between our on-premise’s Qlik server and the AWS DB. Anything like IPS, QOS, bandwidth limitation components that could explain the slowdown.
I also asked the cloud team if they can find anything as well.
It was a high hope for them to find anything; but since they are the SMEs in the area, it was worth asking the question.
As expected, they did not find anything.
But the Network team did come back with a couple of pieces of information:
- The network bandwidth to the AWS was wide enough and we were not reaching its capacity.
- It is a 16ms – 20ms round trip from our Data centre to the AWS data centre.
Loaction… Location…
Physically the distance to the AWS data centre is 700Km.
Unfortunately, AWS set up a closer data centre in the past few years, which is only 130Km away. We are not currently set up to use this new region yet.
The Network team gave me permission to install wire shark on our OnPrem Qlik server and our AWS EC2 Qlik server.
From both servers with psql I connected to the AWS RDS database and updated one row; capturing the traffic using Wireshark.
I lined up the two results from the different servers to see if there was anything obvious
Wireshark results
(ip.src == ip.of.qlik.server and ip.dst == ip.of.aws.rds) or (ip.src == ip.of.aws.rds and ip.dst == ip.of.qlik.server)
SEQ | Source | Destination | Protocol | Length | Info | On Prem 2 RDS | EC2 2 RDS | Difference (sec) | % of difference |
---|---|---|---|---|---|---|---|---|---|
1 | Qlik server | RDS DB | TCP | 66 | 58313 > 5432 [SYN, ECE, CWR] Seq=0 Win=64240 Len=0 MSS=1460 WS=256 SACK_PERM | 0 | 0 | 0.000 | 0% |
2 | RDS DB | Qlik server | TCP | 66 | 5432 > 58313 [SYN, ACK] Seq=0 Ack=1 Win=26883 Len=0 MSS=1460 SACK_PERM WS=8 | 0.019 | 0.001 | 0.018 | 10% |
3 | Qlik server | RDS DB | TCP | 54 | 58313 > 5432 [ACK] Seq=1 Ack=1 Win=262656 Len=0 | 0.000 | 0.000 | 0.000 | 0% |
4 | Qlik server | RDS DB | PGSQL | 62 | >? | 0.000 | 0.005 | -0.005 | -3% |
5 | RDS DB | Qlik server | TCP | 60 | 5432 > 58313 [ACK] Seq=1 Ack=9 Win=26888 Len=0 | 0.018 | 0.000 | 0.018 | 10% |
6 | RDS DB | Qlik server | PGSQL | 60 | < | 0.001 | 0.001 | 0.000 | 0% |
7 | Qlik server | RDS DB | TLSv1.3 | 343 | Client Hello | 0.004 | 0.004 | 0.001 | 0% |
8 | RDS DB | Qlik server | TLSv1.3 | 220 | Hello Retry Request | 0.021 | 0.001 | 0.021 | 12% |
9 | Qlik server | RDS DB | TLSv1.3 | 455 | Change Cipher Spec, Client Hello | 0.003 | 0.001 | 0.002 | 1% |
10 | RDS DB | Qlik server | TLSv1.3 | 566 | Server Hello, Change Cipher Spec | 0.023 | 0.005 | 0.019 | 11% |
11 | RDS DB | Qlik server | TCP | 1514 | 5432 > 58313 [ACK] Seq=680 Ack=699 Win=29032 Len=1460 [TCP segment of a reassembled PDU] | 0.000 | 0.000 | 0.000 | 0% |
12 | RDS DB | Qlik server | TCP | 1514 | 5432 > 58313 [ACK] Seq=2140 Ack=699 Win=29032 Len=1460 [TCP segment of a reassembled PDU] | 0.000 | 0.000 | 0.000 | 0% |
13 | RDS DB | Qlik server | TCP | 1514 | 5432 > 58313 [ACK] Seq=3600 Ack=699 Win=29032 Len=1460 [TCP segment of a reassembled PDU] | 0.000 | 0.000 | 0.000 | 0% |
14 | RDS DB | Qlik server | TLSv1.3 | 394 | Application Data | 0.000 | 0.000 | 0.000 | 0% |
15 | Qlik server | RDS DB | TCP | 54 | 58313 > 5432 [ACK] Seq=699 Ack=5400 Win=262656 Len=0 | 0.000 | 0.000 | 0.000 | 0% |
16 | Qlik server | RDS DB | TLSv1.3 | 112 | Application Data | 0.003 | 0.002 | 0.001 | 1% |
17 | Qlik server | RDS DB | TLSv1.3 | 133 | Application Data | 0.000 | 0.000 | 0.000 | 0% |
18 | RDS DB | Qlik server | TCP | 60 | 5432 > 58313 [ACK] Seq=5400 Ack=836 Win=29032 Len=0 | 0.018 | 0.000 | 0.018 | 10% |
19 | RDS DB | Qlik server | TLSv1.3 | 142 | Application Data | 0.001 | 0.008 | -0.007 | -4% |
20 | RDS DB | Qlik server | TLSv1.3 | 135 | Application Data | 0.006 | 0.003 | 0.003 | 2% |
21 | Qlik server | RDS DB | TCP | 54 | 58313 > 5432 [ACK] Seq=836 Ack=5569 Win=262400 Len=0 | 0.000 | 0.001 | -0.001 | 0% |
22 | Qlik server | RDS DB | TLSv1.3 | 157 | Application Data | 0.005 | 0.007 | -0.002 | -1% |
23 | RDS DB | Qlik server | TLSv1.3 | 179 | Application Data | 0.018 | 0.001 | 0.018 | 10% |
24 | Qlik server | RDS DB | TLSv1.3 | 251 | Application Data | 0.011 | 0.000 | 0.011 | 6% |
25 | RDS DB | Qlik server | TLSv1.3 | 147 | Application Data | 0.018 | 0.000 | 0.018 | 11% |
26 | RDS DB | Qlik server | TLSv1.3 | 433 | Application Data, Application Data | 0.000 | 0.000 | 0.000 | 0% |
27 | RDS DB | Qlik server | TLSv1.3 | 98 | Application Data | 0.000 | 0.000 | 0.000 | 0% |
28 | Qlik server | RDS DB | TCP | 54 | 58313 > 5432 [ACK] Seq=1136 Ack=6210 Win=261888 Len=0 | 0.000 | 0.000 | 0.000 | 0% |
29 | Qlik server | RDS DB | TLSv1.3 | 93 | Application Data | 0.001 | 0.001 | 0.001 | 0% |
30 | RDS DB | Qlik server | TLSv1.3 | 148 | Application Data | 0.020 | 0.001 | 0.018 | 11% |
31 | RDS DB | Qlik server | TLSv1.3 | 98 | Application Data | 0.000 | 0.000 | 0.000 | 0% |
32 | Qlik server | RDS DB | TCP | 54 | 58313 > 5432 [ACK] Seq=1175 Ack=6348 Win=261632 Len=0 | 0.000 | 0.000 | 0.000 | 0% |
33 | Qlik server | RDS DB | TLSv1.3 | 81 | Application Data | 0.000 | 0.000 | 0.000 | 0% |
34 | Qlik server | RDS DB | TLSv1.3 | 78 | Application Data | 0.000 | 0.000 | 0.000 | 0% |
35 | Qlik server | RDS DB | TCP | 54 | 58313 > 5432 [FIN, ACK] Seq=1226 Ack=6348 Win=261632 Len=0 | 0.000 | 0.000 | 0.000 | 0% |
36 | RDS DB | Qlik server | TCP | 60 | 5432 > 58313 [ACK] Seq=6348 Ack=1226 Win=30104 Len=0 | 0.019 | 0.000 | 0.018 | 11% |
37 | RDS DB | Qlik server | TCP | 60 | 5432 > 58313 [FIN, ACK] Seq=6348 Ack=1227 Win=30104 Len=0 | 0.000 | 0.000 | 0.000 | 0% |
38 | Qlik server | RDS DB | TCP | 54 | 58313 > 5432 [ACK] Seq=1227 Ack=6349 Win=261632 Len=0 | 0.000 | 0.000 | 0.000 | 0% |
The data from the two captures showed a couple of things:
Firstly, both systems had the same number of events captured by Wireshark. This gives me an indication that there are no networking components from source to destination that is dropping traffic; or doing anything extra unexpected actions to the packet requests.
I cannot say for sure what is happening on the return trip if there is anything timing out from the AWS side back.
Also, when taking the difference between the OnPrem vs the EC2 server I can see the difference of 18ms keep popping up. I believe this is the round trip of the connection. Since this happens multiple times; our latency is compounded into quite a significant value.
What’s next?
I am not a network engineer, so I do not have the knowledge to dive deeper into the Wireshark packets.
It would be interesting to try the closer AWS data centre to see if the physical distance can help the latency. But to do this will require effort from the cloud team and the project budget wouldn’t extend to this piece of work.
Our other option is to reduce the number of round trips from our OnPrem server to the AWS datacentre as much as possible.