Android self-rescheduling task sometimes stops being run after many invocations

I have a Service in which I run a Runnable repeatedly using a Handler obtained via a HandlerThread and getLooper(). The Runnable is invoked once when the Service starts by calling post(), and thereafter it re-schedules itself using postDelayed(). I successfully use this pattern to run tasks indefinitely in several different Services in this app, but in one particular Service, it occasionally stops working after hundreds or thousands of invocations -- I no longer get the "checking" log output or see the effects of changed settings. Here's the bare-bones relevant code:

public class SettingsMonitor extends Service {

    private final HandlerThread mHandlerThread = new HandlerThread("MonitorHandler");
    private Handler mMonitorHandler;

    @Override
    public int onStartCommand(Intent intent, int flags, int startId) {
        mHandlerThread.start();
        mMonitorHandler = new Handler(mHandlerThread.getLooper());
        mMonitorHandler.post(checkSettings);
        return START_STICKY;
    }

    final Runnable checkSettings = new Runnable() {
        @Override
        public void run() {
            mLastTryMillis = SystemClock.elapsedRealtime();
            Log.d(TAG, "checking device settings");
            if (newSettings()) {
                applySettings();
            }
            mMonitorHandler.removeCallbacksAndMessages(null);
            mMonitorHandler.postDelayed(checkSettings, MONITOR_INTERVAL_MILLIS);
        }
    };

}

When I realized this was sometimes failing, as part of trying to understand what was going on, I added the following check-and-restart method that gets called periodically from the main activity. This keep-alive code runs fine -- I start seeing the "restarting" log output as soon as the Runnable is no longer running -- but the Runnable never actually runs again (I never see the "checking" log output again):

    public void checkAndRestartIfNeeded() {
        long maxMissedMillis = MAX_MISSED_INTERVALS * MONITOR_INTERVAL_MILLIS;
        if (millisSince(mLastTryMillis) > maxMissedMillis) {
            Log.d(TAG, "stalled -- restarting");
            mMonitorHandler.removeCallbacksAndMessages(null);
            mMonitorHandler.post(checkSettings);
        }
    }

Any ideas why the self-rescheduling behavior would fail in the first place? And any ideas (presumably related?!) why the keep-alive hack fails to get it going again? Is something crashing the Runnable without bringing down the rest of the Service or app but preventing it from running? I actually have an app-level uncaught-exception handler to log unexpected things like this but it doesn't fire.

In case it's not clear, the log output from a run might look like this:

checking device settings
checking device settings
checking device settings
... hundreds or thousands of repetitions, then ...
stalled -- restarting
stalled -- restarting
stalled -- restarting
... forever ...

Solution

I was eventually able to establish that no exceptions were being thrown, no threads or processes were dying, and so there was nothing to "keep alive". Everything was, in a sense, working fine, but indeed a process was blocking for hours to days before returning when it "should not have been". Details follow:

The newSettings() method in our code was making two HTTP requests: one to an API on our own server to check for any changes to the app settings (our devices are fixed deployments managed remotely without hands-on intervention), and one to api.ipify.org to get the device's current public IP address. Both of these calls used a URLConnection, and this connection was configured with a connection timeout of 10 seconds and a read timeout of 10 seconds, both set before opening the connection.

Most of the time, the connections either succeeded or the timeouts worked as expected. However, for the call to the 3rd-party API, sometimes -- for reasons I do not understand -- the attempt to connect and retrieve content from that API would take hours to days to complete. If I waited long enough, the call would eventually return and normal behavior was restored to the self-scheduling task running in the service, as though there had never been a problem.

I do not understand why the timeouts seemingly had no effect on these occasions. The only possible clue I have found is that for domain names that resolve to multiple IP addresses, as api.ipify.org does at this site and others, URLConnection will attempt to connect to each address until it succeeds, and each of those attempts will apply the specified connect and read timeouts, so the total time taken in such a case can be several multiples of the individual timeouts. This does not explain how the call could block for hours to days, but it's the only hint I've come across so far.

In the process of figuring this out, I realized that there is no need to call the 3rd-party API to get the device's IP address, because it's available to our own API when the device checks for its settings. So I have eliminated the second call and modified our settings API to include the caller's IP address in its response; I have also eliminated the keep-alive mechanism because it is unnecessary. These changes thre frequency of the problems I originally posted, but it was still happening about once every 15,000 times -- still not acceptable for our use case.

Somewhere along the way I read something (I have lost track of where) that suggested the Scanner I was using in newSettings() for processing the connection's getInputStream() was a less robust/reliable approach than using a BufferedReader on an InputStreamReader. After changing to use that approach in newSettings() there have been zero occurrences of the problem in 30,000+ calls, so I'm hopeful that this problem is fully resolved.