Automatically switching to a working network

By Abhijit Menon-Sen <>

We've never had an internet connection so reliable that we could just ignore outages. No matter the technology, the link stops working often enough that we have to plan for a reasonable backup. For the past several years, since 4G services became usable in our area, using a mobile hotspot is a viable fallback option for general traffic.

I remember using a Nokia phone as a GPRS modem, which required hackery to make CDC-ACM work, and was barely fast enough to read email at the best of times. 3G made things a little better, and Android phones with wifi hotspots were easier to use, but it was only years later with 4G that phone connectivity became fast enough and cheap enough for anything other than work. (5G has not made it to our valley, but it works as little as 5km away, across one big hill.)

Switching uplinks manually

At home, when our primary connection stopped working, the easiest thing to do was to connect through the mobile hotspot on one or more of our phones. We have SIMs from the "big four" telecom providers, and chances are that one of them works well enough at any given time. Since we use wireguard anyway, things would keep working.

But disconnecting from the house wifi meant that our laptops could no longer access things like the NAS and printer. For any but the briefest of interruptions, that was entirely impractical. “No problem”, I thought, “I'll just make the home router connect to the hotspot.”

That was where the real problems began. I would login to the router whenever I realised that the connection wasn't working, and switch uplinks manually. But sometimes it took me a while to notice; and sometimes it wasn't as simple as “one link bad, other link good”, and I had to investigate before I could restore connectivity. Any such delay, and chances were that someone would get annoyed and end up connecting to their hotspot anyway.

A search for sophisticated automation

I felt sure that I could find something to switch routes automatically, rather than having to cobble together a script to run ping and ip route commands.

Years passed. After much research, I learned that the “right way” would be to use network bonding to aggregate both uplinks into a single interface with an active-backup configuration. For various reasons, I never got this to work well.

There were two main classes of problems. One of the links was wlan0, subject to the vagaries of wpa_supplicant and dhclient. I wanted to be able to connect to a different hotspot easily, and still have everything work properly. In theory, all the pieces to put this together were available, but I wasn't able to put them together into something that worked reliably. Too many times, I would end up with wlan0 disconnected altogether, or connected to a different network, but without (ever) triggering a new DHCPREQUEST.

(I also tried iwd, which has an integrated DHCP client. It looks promising, but it behaved strangely with multiple wifi interfaces, and it would sometimes fail to reconnect after a disconnection. Maybe someday it'll be the right answer to all of these problems, but not today.)

The other problem is that the kernel triggers failover on a bonded interface when a link goes down—imagine unplugging an Ethernet cable. (This is also true for other things that do route failover, like routing daemons.) In our case, though, an upstream router would drop our packets, but the link would never physically go down. So I would need something to monitor the link and force a failover by running a command.

Crude automation to the rescue

I wrote the simplest, crudest failover script I could.

#!/bin/bash

# We keep a static route through the main router, so that
# we can always tell if the fibre optic link is up.

APE="p.q.r.s"
ip route replace "$APE" via 192.168.11.1

switch_gw() {
    ip route replace default via "$1"
}

# If the current route doesn't provide outside connectivity
# for more than 15 seconds, switch to another one and see if
# it works any better. Rinse and repeat, forever.

while /bin/true;
do
    # Which is our current default gateway?
    GW=$(
        ip -j r l
        | jq -r '.[] | select(.dst == "default") | .gateway'
        | head -1
    )

    # Hardcoded potential gateway options,
    # one via ethernet and one via wifi.

    if [[ $GW == "192.168.43.1" ]];
    then
        OTHER_GW=192.168.11.1
    else
        OTHER_GW=192.168.43.1
    fi

    # The IP address of a host to ping to check connectivity.
    PING_DST="a.b.c.d"

    # If the preferred gateway is not explicitly set,
    # default to the faster link.
    PREFERRED_GW=$(cat /etc/default/preferred-gateway 2>/dev/null)

    if [[ $GW != ${PREFERRED_GW:-192.168.11.1} ]];
    then
    	if /bin/ping -n -w1 -c1 "$APE" &>/dev/null;
        then
            echo "Route through $OTHER_GW is working"
            sleep 10;
            if /bin/ping -n -w1 -c1 "$APE" &>/dev/null;
            then
                echo "Switching route back from $GW to $OTHER_GW"
                switch_gw "$OTHER_GW"
		sleep 10
		continue
            fi
        fi
    fi

    if ! /bin/ping -n -w1 -c1 "$PING_DST" &>/dev/null;
    then
        if ! /bin/ping -n -w2 -c1 "$PING_DST" &>/dev/null;
        then
            echo "Route through $GW is not working"
            sleep 10;
            if ! /bin/ping -n -w3 -c1 "$PING_DST" &>/dev/null;
            then
                echo "Switching route from $GW to $OTHER_GW"
                switch_gw "$OTHER_GW"
            fi
        fi
    fi

    sleep 10
done

Don't miss the ladder of ping commands with increasing waits to detect when a link is or isn't working. Is it crude? Yes. Does it work? Yes. I start this script as a systemd service, and it runs forever.

This code has two features that I specifically wanted: it switches back to the main connection when it starts working again, and I can echo an IP address into /etc/default/preferred-gateway to force it to prefer another gateway. (That's how I can avoid an uplink that has persistently high packet loss.)

This code works for me, but it has hardcoded IP addresses and other cruft, and I have no interest in making it more widely usable or supporting its use.

Is there no hope for sophistication?

The weakest part of the approach above is the dependence on running ping intermittently against a fixed list of targets in order to estimate link quality.

Ideally, I would like a better measure of connectivity. For example, checking in /proc/net/tcp to see if established TCP connections start having retransmits, or by looking for an increase in the number of failing DNS queries on the system. (I don't know offhand how to find an answer to that, other than sniffing traffic. That would work for us at home, but I'd probably try to find another way first.)

(One problem with this approach: it could check only the current default route. It would need to do something else to figure out if other links are working, in order to switch back to a preferred link after a failure.)

But crude works so well that I don't plan to work on any of this.