We've never had an internet connection so reliable that we could just
ignore outages. No matter the technology, the link stops working often
enough that we have to plan for a reasonable backup. For the past
several years, since 4G services became usable in our area, using a
mobile hotspot is a viable fallback option for general traffic.
I remember using a Nokia phone as a GPRS modem, which required hackery
to make CDC-ACM work, and was barely fast enough to read email at the
best of times. 3G made things a little better, and Android phones with wifi
hotspots were easier to use, but it was only years later with 4G that
phone connectivity became fast enough and cheap enough for anything
other than work. (5G has not made it to our valley, but it works as
little as 5km away, across one big hill.)
Switching uplinks manually
At home, when our primary connection stopped working, the easiest thing
to do was to connect through the mobile hotspot on one or more of our
phones. We have SIMs from the "big four" telecom providers, and chances
are that one of them works well enough at any given time. Since we use
wireguard anyway, things would keep working.
But disconnecting from the house wifi meant that our laptops could
no longer access things like the NAS and printer. For any but the
briefest of interruptions, that was entirely impractical. “No problem”,
I thought, “I'll just make the home router connect to the hotspot.”
That was where the real problems began. I would login to
the router whenever I realised that the connection wasn't working, and
switch uplinks manually. But sometimes it took me a while to notice; and
sometimes it wasn't as simple as “one link bad, other link good”, and I
had to investigate before I could restore connectivity. Any such delay,
and chances were that someone would get annoyed and end up connecting to
their hotspot anyway.
A search for sophisticated automation
I felt sure that I could find something to switch routes
automatically, rather than having to cobble together a script to run
ping and ip route commands.
Years passed. After much research, I learned that the “right way”
would be to use network bonding to aggregate both uplinks into a single
interface with an active-backup configuration. For various reasons, I
never got this to work well.
There were two main classes of problems. One of the links was wlan0,
subject to the vagaries of wpa_supplicant and dhclient. I wanted to be
able to connect to a different hotspot easily, and still have everything
work properly. In theory, all the pieces to put this together were
available, but I wasn't able to put them together into something that
worked reliably. Too many times, I would end up with wlan0 disconnected
altogether, or connected to a different network, but without (ever)
triggering a new DHCPREQUEST.
(I also tried iwd, which has an integrated DHCP client. It looks
promising, but it behaved strangely with multiple wifi interfaces, and
it would sometimes fail to reconnect after a disconnection. Maybe
someday it'll be the right answer to all of these problems, but not
today.)
The other problem is that the kernel triggers failover on a bonded
interface when a link goes down—imagine unplugging an Ethernet cable.
(This is also true for other things that do route failover, like routing
daemons.) In our case, though, an upstream router would drop our
packets, but the link would never physically go down. So I would need
something to monitor the link and force a failover by running a command.
Crude automation to the rescue
I wrote the simplest, crudest failover script I could.
#!/bin/bash
# We keep a static route through the main router, so that
# we can always tell if the fibre optic link is up.
APE="p.q.r.s"
ip route replace "$APE" via 192.168.11.1
switch_gw() {
ip route replace default via "$1"
}
# If the current route doesn't provide outside connectivity
# for more than 15 seconds, switch to another one and see if
# it works any better. Rinse and repeat, forever.
while /bin/true;
do
# Which is our current default gateway?
GW=$(
ip -j r l
| jq -r '.[] | select(.dst == "default") | .gateway'
| head -1
)
# Hardcoded potential gateway options,
# one via ethernet and one via wifi.
if [[ $GW == "192.168.43.1" ]];
then
OTHER_GW=192.168.11.1
else
OTHER_GW=192.168.43.1
fi
# The IP address of a host to ping to check connectivity.
PING_DST="a.b.c.d"
# If the preferred gateway is not explicitly set,
# default to the faster link.
PREFERRED_GW=$(cat /etc/default/preferred-gateway 2>/dev/null)
if [[ $GW != ${PREFERRED_GW:-192.168.11.1} ]];
then
if /bin/ping -n -w1 -c1 "$APE" &>/dev/null;
then
echo "Route through $OTHER_GW is working"
sleep 10;
if /bin/ping -n -w1 -c1 "$APE" &>/dev/null;
then
echo "Switching route back from $GW to $OTHER_GW"
switch_gw "$OTHER_GW"
sleep 10
continue
fi
fi
fi
if ! /bin/ping -n -w1 -c1 "$PING_DST" &>/dev/null;
then
if ! /bin/ping -n -w2 -c1 "$PING_DST" &>/dev/null;
then
echo "Route through $GW is not working"
sleep 10;
if ! /bin/ping -n -w3 -c1 "$PING_DST" &>/dev/null;
then
echo "Switching route from $GW to $OTHER_GW"
switch_gw "$OTHER_GW"
fi
fi
fi
sleep 10
done
Don't miss the ladder of ping commands with increasing waits to
detect when a link is or isn't working. Is it crude? Yes. Does it work?
Yes. I start this script as a systemd service, and it runs forever.
This code has two features that I specifically wanted: it switches
back to the main connection when it starts working again, and I can echo
an IP address into /etc/default/preferred-gateway to force it to prefer
another gateway. (That's how I can avoid an uplink that has persistently
high packet loss.)
This code works for me, but it has hardcoded IP addresses and other
cruft, and I have no interest in making it more widely usable or
supporting its use.
Is there no hope for sophistication?
The weakest part of the approach above is the dependence on running
ping intermittently against a fixed list of targets in order to estimate
link quality.
Ideally, I would like a better measure of connectivity. For example,
checking in /proc/net/tcp to see if established TCP connections start
having retransmits, or by looking for an increase in the number of
failing DNS queries on the system. (I don't know offhand how to find an
answer to that, other than sniffing traffic. That would work for us at
home, but I'd probably try to find another way first.)
(One problem with this approach: it could check only the current
default route. It would need to do something else to figure out if other
links are working, in order to switch back to a preferred link after a
failure.)
But crude works so well that I don't plan to work on any of this.