I have a Python script that consolidates reports by using Pandas all along in a sequence of DataFrame operations (drop, groupby, sum, etc). Let's say I start with a simple function that cleans all columns that has no values, it has a DataFrame as input and output:
# cei.py
def clean_table_cols(source_df: pd.DataFrame) -> pd.DataFrame:
# IMPLEMENTATION
# eg. return source_df.dropna(axis="columns", how="all")
I wanted to verify in my tests that this function actually removes all columns that all values are empty. So I arranged a test input and output, and test with assert_frame_equal
function from pandas.testing:
# test_cei.py
import pandas as pd
def test_clean_table_cols() -> None:
df = pd.DataFrame(
{
"full_valued": [1, 2, 3],
"all_missing1": [None, None, None],
"some_missing": [None, 2, 3],
"all_missing2": [None, None, None],
}
)
expected = pd.DataFrame({"full_valued": [1, 2, 3], "some_missing": [None, 2, 3]})
result = cei.clean_table_cols(df)
pd.testing.assert_frame_equal(result, expected)
My question is if it is conceptually a unit test or an e2e/integration test, since I am not mocking pandas implementation. But if I mock DataFrame, I won't be testing the functionality of the code. What is the recommended way to test this following TDD best practices?
Note: using Pandas in this project is a design decision, so there on purpose no intention to abstract Pandas interfaces to maybe replace it with other library in the future.
Yes, this code is effectively an integration test, which may not be a bad thing.
Even if using pandas is a fixed design decision, there are still many good reasons to abstract from external libraries Testing is one of those. Abstracting from external libraries allows for testing of the business logic independently of the libraries. In this case, abstracting from pandas would make the above a unit test. It would test the interactions with the library.
To apply this pattern, I recommend taking a look at the ports and adapters architecture pattern
However, it does indeed mean that you're no longer testing the functionality provided by pandas. If this is still your specific intent, an integration test is not a bad solution.