What parts of analytical code to write unit tests for?

For the last while I've been writing analytical Python code that gets run on demand when users interact with a front end tool throught a queue based batch processing.

Typically the users set some values in the front end tool that get passed as parameters to the analytical code and they either supply a dataset or choose a subset of data from an overall data source that their company provides.

Typically each analytical model sits in a larger repo amongst other analytical models so each model would usually sit in it's own module and that module would export one function which is the entrypoint in to that model. The models range from being simple models that take on the order of minutes to very complex stastical or machine learning based models and might use combinations of numpy/Pandas/Numba or Dask dataframes that take on the order of hours.

Now to my question, I've been going back on forth on where I should be aiming to concentrate my testing efforts for this type of code. Previously earlier on in my career I naively thought that every function should have a unit test so my code would have a comprehensive of set of tests. I quickly realised that this was counter-productive as even a small performance refactor could result in ripping apart and possibly even throwing away a lot of the unit tests. So clearly it felt like I should only be writing tests for the main public function of each model, however, this usually means the opposite happening, for some of the more complex models, edge cases that were quite deep into the control flow were now hard to test.

My question then is how should I be aiming to properly test these analytical models? Some people would probably say "Only test public facing functions, if you can't test edge cases through the public facing functions then they should technically not be reachable so don't need to be there". But, I've found, in reality this doesn't quite work.

To provide a simple example, say the particular model is to calculate a frequency matrix for dropoff/pickoff points from a taxi dataset.

import pandas as pd


def _cat(col1, col2):
    cat_col = col1.astype(str).str.cat(col2.astype(str), ', ')
    return cat_col


def _make_points_df(taxi_df):
    pickup_points = _cat(taxi_df["pickup_longitude"], taxi_df["pickup_latitude"])
    dropoff_points = _cat(taxi_df["dropoff_longitude"], taxi_df["dropoff_latitude"])
    points_df = pd.DataFrame({"pickup": pickup_points, "dropoff": dropoff_points})
    return points_df


def _points_df_to_freq_mat(points_df):
    mat_df = points_df.groupby(['pickup', 'dropoff']).size().unstack(fill_value=0)
    return mat_df


def _validate_taxi_df(taxi_df):
    if type(taxi_df) is not pd.DataFrame:
        raise TypeError(f"taxi_df param must be a pandas dataframe, got: {type(taxi_df)}")
    expected_cols = {
        "pickup_longitude",
        "pickup_latitude",
        "dropoff_longitude",
        "dropoff_latitude",
    }
    if set(taxi_df) != expected_cols:
        raise RuntimeError(
            f"Expected the following columns for taxi_df param: {expected_cols}."
            f"Got: {set(taxi_df)}"
        )


def calculate_frequency_matrix(taxi_df, long_round=1, lat_round=1):
    """Calculate a dropoff/pickup frequency matrix which tells you the number of times
    passengers have been picked up and dropped from a given discrete point. The
    resolution of these points is controlled by using the long_round and lat_round params

    Paramaters
    ----------
    taxi_df : pandas.DataFrame
        Dataframe specifying dropoff and pickup long/lat coordinates
    long_round : int
        Number of decimal places to round the dropoff and pickup longitude values to
    lat_round : int
        Number of decimal places to round the dropoff and pickup latitude values to

    Returns
    -------
    pandas.DataFrame
        Dataframe in matrix format of frequency of dropoff/pickup points

    Raises
    ------
    TypeError : If taxi_df is not a pandas DataFrame
    RuntimeError : If taxi_df does not contain correct columns
    """
    _validate_taxi_df(taxi_df)
    taxi_df = taxi_df.copy()
    taxi_df["pickup_longitude"] = taxi_df["pickup_longitude"].round(long_round)
    taxi_df["dropoff_longitude"] = taxi_df["dropoff_longitude"].round(long_round)
    taxi_df["pickup_latitude"] = taxi_df["pickup_latitude"].round(lat_round)
    taxi_df["dropoff_latitude"] = taxi_df["dropoff_latitude"].round(lat_round)

    points_df = _make_points_df(taxi_df)
    mat_df = _points_df_to_freq_mat(points_df)
    return mat_df

Taking in a dataframe like

        pickup_longitude  pickup_latitude  dropoff_longitude  dropoff_latitude
0         -73.988129        40.732029         -73.990173         40.756680
1         -73.964203        40.679993         -73.959808         40.655403
2         -73.997437        40.737583         -73.986160         40.729523
3         -73.956070        40.771900         -73.986427         40.730469
4         -73.970215        40.761475         -73.961510         40.755890
5         -73.991302        40.749798         -73.980515         40.786549
6         -73.978310        40.741550         -73.952072         40.717003
7         -74.012711        40.701527         -73.986481         40.719509

Say in terms of a folder structure this code would sit at analytics/models/taxi_freq/taxi_freq.py and the analytics/models/taxi_freq/__init__.py file would look like

from taxi_freq import calculate_frequency_matrix

And obviously the private functions in the above code could be split across multiple utiltiy files in analytics/models/taxi_freq/.

Would the consensus be to only test the calculate_frequency_matrix function, or should the "private" helper methods and other utility files/functions within the taxi_freq module also be tested?

Solution

As with software development in general, also with testing you always have to search for solutions that represent the (ideally optimal) tradeoff between competing goals. One of the primary goals of testing in general and also for unit-testing is to find bugs (see Myers, Badgett, Sandler: The Art of Software Testing, or, Beizer: Software Testing Techniques, but also many others).

In your project you may have a more relaxed position on this, but there are many software projects where it would have serious consequences if implementation level bugs escape to later development phases or even to the field. Some say, your goal should rather be to increase confidence in your code - and this is also true, but confidence can only be a consequence of doing testing right. If you don't test to find bugs, then I will simply not have confidence in your code after you have finished testing.

When finding bugs is a primary goal of unit-testing, then attempts to keep unit-test suites completely independent of implementation details is likely to result in inefficient test suites - that is, test suites that are not suited to find all bugs that could be found. Different implementations have different potential bugs. If you don't use unit-testing for finding these bugs, then any other test level (integration, subsystem, system) is definitely less suited for finding them systematically.

For example, think about the different ways to implement a Fibonacci function: as an iterative or recursive function, as a closed form expression (Moivre/Binet), or as a lookup table: The interface is always the same, the possible bugs differ significantly, and so do the unit-testing strategies. There will be a useful set of implementation independent test cases, but these alone will not be sufficient to find all bugs that are likely for the specific implementation.

The goal to have an effective test suite therefore is in competition with another goal, namely to have a maintenance friendly test suite. This goal, however, comes in different forms with different consequences: You could demand that the unit-test suite shall not be affected when implementation details change. This is quite tough and IMO puts the secondary goal of maintenance friendly test code above the primary goal of finding bugs.

Meszaros has a more balanced formulation, namely "The effort for changes to the code base shall be commensurate with the effort to maintain the test suite." (see Meszaros: Principles of Test Automation: Ensure Commensurate Effort). That is, little changes to the SUT shall only require little changes to the test suite, for larger changes to the SUT it is acceptable that the test suite requires equally large changes. (However, for me personally the formulation "the effort for test code maintenance shall be low" is sufficient.)

Conclusion:

For me, as I see finding bugs as the primary goal and test suite maintainability as a secondary goal, this leads to the following consequence: I accept that I have to test also implementation details to find more bugs. But, despite this fact I nevertheless try to keep the maintenance effort low: I do this mostly by applying the following mechanisms that aim at making it simpler to adjust the test suite in case of changes to the SUT:

First, if the goal of a specific test case can be reached by an implementation agnostic test case and an implementation dependent test case, prefer the implementation agnostic test case. In other words, don't make individual test cases unnecessarily implementation dependent.
Second, hide implementation details behind helper functions. There can be helper functions for specific setups, teardowns, assertions etc. This is an extremely powerful mechanism to limit the effect of implementation details within the test suite.