Failure number one
Here is one failure, the regularity of which is oddly satisfying, yet also troubling:
Yes, every other packet is dropped, in a completely regular pattern. Not to mention that the remaining half have pretty appalling latency. I don't know what caused this one. Fortunately I didn't have to stick around long enough to find out.
Failure number two
Here is the second, with which I have become considerably more familiar:
In this variety of broken router, there are periodic dropouts every 20–40 seconds, which in turn last about 20–40 seconds. Overall packet loss ranges around 30-70%. This is worse than 30–70% of packets being lost randomly, mind you, because the 30-second-long dropouts where every packet in that period is lost convince most software that the internet is down, so they stop retrying entirely. That makes stuff fail to work, rather than dealing with it as just a bad-quality network with high packet loss. The normal mechanisms in place to deal with random packet loss (like those baked into TCP) don't deal with these kinds of 30-second-long dropouts.
I've run into this failure twice now, both of them in buildings using the same kind of interface, a powerline-networking unit from devolo. These units are fairly common in the UK, as many buildings hastily converted to B&B type use have high-speed internet enter at the ground floor, and then distribute it through the building with powerline networking. Initially I thought this powerline networking aspect was at fault. When I first encountered the idea of transmitting data by slightly modulating the electric wiring in a house, it seemed pretty fantastical, and early units were indeed not of great quality, especially prone to interference from almost any kind of appliance plugged into the circuit.
But today's units really are quite reliable. I was able to figure out where the dropouts were occurring by downloading devolo's software. Yes, I know, I rarely expect much from this kind of software. And indeed, it was not great. But it was able to retrieve a map of the network, with the IP addresses of its various components. Given that information, simply pinging them all made it obvious that the dropouts were not over the powerline portion. That portion worked perfectly reliably. But the router on the far end of the powerline segment, which was supposed to route the data on to the internet, would periodically stop forwarding my packets.
The first time I ran into this failure mode, I noticed that my Android phone didn't have these kinds of dropouts; only my Mac laptop did. (Yes, I disabled mobile data to make sure the phone wasn't transparently falling back to that.) This led to a suspicion that OSX was doing something weird that distressed the router, whether it was really OSX's fault or not. Alas, I was not able to find anything particular that did it. I played around with things like MTUs, based on some suggestions found on the internet, but this was all a dead end. I did come up with a very suboptimal workaround, though, based on the twin observations that my Mac laptop was not able to reliably get to the internet through this router, but my Android phone was: I tethered the laptop to the phone via bluetooth, using the phone as a wifi modem. This worked, although not all that well, since Android seems to forward bluetooth-tethered packets with quite some latency (frequently >1000ms latency).
The second time I ran into this exact same problem, though, on the same
kind of device, I figured there must be a better way than bluetooth-tethering
to my phone. Some experimentation revealed that cycling the wifi (i.e. going to
"Turn Wi-Fi Off" then immediately "Turn Wi-Fi On" in OSX) would put it back
into the "good" part of the 30-second cycle. After doing this manually for a
bit to verify that it worked, I came up with the following shell script to do
it automatically. This sends bursts of two pings per second, and uses OSX'snetworksetup
utility to cycle the wifi if, in any given second,
both pings fail to go through:
#!/bin/sh while : do if ! ping -i 0.5 -c 2 -t 1 mjn.anadrome.org > /dev/null then networksetup -setairportpower en0 off; networksetup -setairportpower en0 on echo "Dropout:" `date` fi done
You can of course tweak the threshold for resetting here. Sending two pings per second and resetting the interface if both fail is an arbitrary choice.
Some further experimentation reveals that fully cycling the wifi isn't
necessary. What resets the connection to a working state is simply requesting a
new DHCP lease. This presumably identifies us as a new connection and resets a
problematic buffer somewhere in the router. Unfortunately, the Mac OSXnetworksetup
utility doesn't provide a way to request a new DHCP
lease without cycling wifi entirely, even though the GUI interface in System
Preferences does. Some friendly folks on the internet have however figured out
how System Preferences does it, which allows us to produce the following shell
script that runs the same commands that it runs behind the scenes. This version
of the script must be run as root, since no user-level command-line tools
expose the required functionality:
#!/bin/sh while : do if ! ping -i 0.5 -c 2 -t 1 mjn.anadrome.org > /dev/null then echo "add State:/Network/Interface/en0/RefreshConfiguration temporary" | scutil echo "Dropout:" `date` fi done
With this script running, dropouts are recovered within 1-2 seconds. I now have only about 3% packet loss, versus 30-70% previously, which is low enough that most stuff works perfectly fine, treating the occasional dropped packet as routine. Problem solved, sort of!