When it comes to problem solving people tend to go in the direction that are most familiar. If a client is having an issue the developer will look in their code and the network engineer may look at port statistics on their switch. As one who wears the engineer hat most of the time I tend to take the OSI model route
Over the last few days a good portion of my time has been taken up by a VIP client that was having what seemed to be random call quality issues. The customer has multiple offices around Europe and is using AWS to host their PBX. We use VoIPMonitor to trace all of our calls. In addition I built tooling so that we run mtr traces once a minute to all IP’s so should there be a network issue. As I have said many times in the past I don’t want to have to reproduce the issue I want to able to back in time and see what the problem was so we can fix it right away. As soon as the complaints came in we went straight VoipMonitor and the traces. All the call captures shows packet loss and jitter. The mtr traces did show some jitter but it did not match what we were seeing in the call captures. The call captures were showing a much grimmer (and what ended up being more accurate) picture. We asked the usual questions and got the standard response. They watch their servers, everything was fine. They insisted the issue was with our ISP.
At this point I suggested that perhaps the issue was not between us (as I suspected) but from their AWS instance to their other offices. They didn’t think so but were willing to entertain the idea. I asked the client to set up traces from their AWS site to their offices so we can see perhaps where the issue was they assured me they would. Sure enough the next day there were issues. When I asked for the traces they said they didn’t have them as the traffic was tunneled between sites so a trace would not help. I explained that even if it was P2P if there was some sort of internet issue we would see it in the form of jitter and or packet loss from the remote device. I went so far as to launch an AWS instance in the same region as the client with an IP in the same /15. I then set up scripts to run traces to all of their IP’s to see if I could replicate the issue.
Fast forward to this morning and the issues started again. I looked at the traces from my AWS instance and the traffic was clean. We did a screen share to the clients server where I did several traces to the endpoints. I had the following
- MTR to the remotes public IP
- MTR to the remotes internal IP (where the traffic was going over IPSEC)
- MTR to our site (for comparison)
- Ping to the remotes external IP
- Ping to the remotes interal IP (where the traffic was going over IPSEC)
On my AWS instance I was doing both a ping and mtr trace.
Right away both the mtr and pints were showing jitter. There was on average a 20ms divide (ping times were anywhere from 65ms to 90ms on average). MTR made things look a bit worse than they were since it would show a max ping time of up to 300ms. As a rule unless the ping times are all over the place or they are always high if one in a few thousand packets has a delay I don’t pay much attention to it. We watched the screens for a while and there was clearly a networking issue. The pings and traces from my AWS box were coming back clean. There was almost no jitter and the ping times for the most part were consistent. At this point I was thinking that perhaps the issue was with the host instance and maybe it being overloaded. Perhaps there was another instance on the host box that was under attack. I started to poke around a bit more. As mentioned above the client said the box was clean so I didn’t look at the load right away. At this point I was out of ideas. I ran top and sure enough the load average was at 18. The box had 8 cores which means there were more tasks waiting for the CPU then it was capable of handling. If you have a script that is doing compression, if the CPU is overloaded an extra moment wont matter. On the other hand with voice it absolutely matters. If a packet is delayed by even 20ms it can hurt the call.
Next I checked the memory in use, if any of it was in swap or if there was any wait on the disks as these can be a cause for high load averages, they all came back clean. Next I used top and set top with the delay parameter set to 10 seconds. The reason for this is should there be a jump for a small amount of time I would want enough time to see the processes that were causing the spike. By default top refreshes every 3 seconds. I then did “SHIFT + p” over and over till I saw which processes were causing an increase in CPU usage. There were a few scripts that were running that would randomly spike with a total usage of nearly 800% of the CPU (which is all 8 cores on the box). This in turn caused the load to over the ideal limit of 8 which seems to have been causing the issue. As the scripts were not essential we killed them. With in a few minutes the load average went down to about 2 and the complaints stopped.
Whatever hat we are wearing we can’t have tunnel vision. You need to always look at the whole picture and never assume. As an engineer I let the OSI model take over my thinking and I didn’t start with the possibility that the box was overloaded which caused it to start dropping packets.