RouterPing - Gathering ICMPv4 statistics about a router
Monday, September 24. 2018
When talking about the wonderful world of computers, networking and all the crap we enjoy tinkering with, sometime somewhere things don't go as planned. When thinking this bit closer, I think that's the default. Okok. The environment is distributed and in a distributed environment, there are tons of really nice places that can be misconfigured or flaky.
My fiber connection (see the details from an old post) started acting weird. Sometimes it failed to transmit anything in or out. Sometimes it failed to get a DHCP-lease. The problem wasn't that frequent, once or twice a month for 10-15 minutes. As the incident happened on any random moment, I didn't much pay attention to it. Then my neighbor asked if my fiber was flaky, because his was. Anyway, he called ISP helpdesk and a guy came to evaluate if everything was ok. Cable-guy said everything was "ok", but he "cleaned up" something. Mightly obscure description of the problem and the fix, right?
Since the trust was gone, I started thinking: Why isn't there a piece of software you can run and it would actually gather statistics if your connection works and how well ISP's router responds. Project is at https://github.com/HQJaTu/RouterPing. Go see it!
How to use the precious router_ping.py
Oh, that's easy!
First you need to establish a point in network you want to measure. My strong suggestion is to measure your ISP's router. That will give you an idea if your connection works. You can also measure a point away from you, like Google or similar. That woul indicate if your connection to a distant point would work. Sometimes there has been a flaw in networking preventing access to a specific critical resource.
On Linux-prompt, you can do a ip route show
. It will display your default gateway IP-address. It will say something like:
default via 62.0.0.1 dev enp1s0
That's the network gateway your ISP suggest you would want to use to make sure your Internet traffic would be routed to reach distant worlds. And that's precisely the point you want to measure. If you cannot reach the nearest point in the cold Internet-world, you cannot reach anything.
When you know what to measure, just run the tool.
Running the precious router_ping.py
Run the tool (as root):
./router_ping.py
<remote IP>
To get help about available options --help is a good option:
optional arguments:
-h, --help show this help message and exit
-i INTERVAL, --interval INTERVAL
Interval of pinging
-f LOGFILE, --log-file LOGFILE
Log file to log into
-d, --daemon Fork into background as a daemon
-p PIDFILE, --pid-file PIDFILE
Pidfile of the process
-t MAILTO, --mail-to MAILTO
On midnight rotation, send the old log to this email
address.
A smart suggestion is to store the precious measurements into a logfile. Also running the thing as a daemon would be smart, there is an option for it. Also, the default interval to send ICMP-requests is 10 seconds. Go knock yourself out to use whatever value (in seconds) is more appropriate for your measurements.
If you're creating a logfile, it will be auto-rotated at midnight of your computer's time. That way you'll get a bunch of files instead a huge one. The out-rotated file will be affixed with a date-extension describing that particular day the log is for. You can also auto-email the out-rotated logfile to any interested recipient, that's what the --mail-to
option is for.
Analysing the logs of the precious router_ping.py
Outputted logfile will look like this:
2018-09-24 09:09:25,105 EEST - INFO - 62.0.0.1 is up, RTT 0.000695
2018-09-24 09:09:35,105 EEST - INFO - 62.0.0.1 is up, RTT 0.000767
2018-09-24 09:09:45,105 EEST - INFO - 62.0.0.1 is up, RTT 0.000931
Above log excerpt should be pretty self-explanatory. There is a point in time, when a ICMP-request to ISP-router was made and response time (in seconds) was measured.
When something goes south, the log will indicate following:
2018-09-17 15:01:42,264 EEST - ERROR - 62.0.0.1 is down
The log-entry type is ERROR instead of INFO and the log will clearly indicate the failure. Obviously in ICMP, there is nothing you can measure. The failure is defined as "we waited a while and nobody replied". In this case, my definition of "a while" is one (1) second. Given the typical measurement being around 1 ms, thousand times that is plenty.
Finally
Unfortunately ... or fortunately depending on how you view the case, the cable guy fixed the problem. My system hasn't caught a single failure since. When it happens, I'll have proof of it.