Search code examples
pythonjsonpytesttweets

Pytest with line-delineated json


I'm relatively new to Python and really new to pytest. Anyways, I'm trying to write some tests for parsing tweets that are in line-delineated json. Here's a simplified example test_cases.jsonl:

{"contributors":null,"coordinates":null,"created_at":"Sat Aug 20 01:00:12 +0000 2016","entities":{"hashtags":[{"indices":[97,116],"text":"StandWithLouisiana"}]}}
{"contributors":null,"coordinates":null,"created_at":"Sat Aug 20 01:01:35 +0000 2016","entities":{"hashtags":[]}}

What I would like to do is test a function like the following:

def hashtags(t):
    return ' '.join([h['text'] for h in t['entities']['hashtags']])

I can test a single line of the JSON as follows:

@pytest.fixture
def tweet(file='test_cases.jsonl'):
    with open(file, encoding='utf-8') as lines:
        for line in lines:
            return json.loads(line)


def test_hashtag(tweet):
    assert hashtags(tweet) == 'StandWithLouisiana'

(I'm just giving the file name as the argument for the function for this example)

This works in the sense that the test passes because the first line passes the test, but what I'm basically trying to do is something like this and I don't expect this to work as it is written.

def test_hashtag(tweet):
    assert hashtags(tweet) == 'StandWithLouisiana' # first tweet
    assert hashtags(tweet) == ''    # second tweet

This fails because it tests whether the first tweet (line in the json) is empty, not the second. I assume that's because of the return in the fixture, but if I try to yield instead of return, I get a yield_fixture function has more than one 'yield' error` (and the second line still fails).

What I'm doing now to get around this issue is to make each line a separate JSON file and then creating a separate fixture for each of them. (For shorter examples, I'm using StringIO to write the JSON inline). This does work but feels inelegant. I have a feeling that I should use @pytest.mark.parametrize for this, but I can't get my head around it. I think I also tried pytest_generate_tests to do this as well, but it would up testing every key. Is it possible to do what I'm thinking of, or is it better to create separate fixtures when I have different values for the assertions?


Solution

  • I think the most fitting approach for you would be parametrizing the fixture:

    import json
    import pathlib
    import pytest
    
    
    lines = pathlib.Path('data.json').read_text().split('\n')
    
    @pytest.fixture(params=lines)
    def tweet(request):
        line = request.param
        return json.loads(line)
    
    
    def hashtags(t):
        return ' '.join([h['text'] for h in t['entities']['hashtags']])
    
    
    def test_hashtag(tweet):
        assert hashtags(tweet) == 'StandWithLouisiana'
    

    This will invoke test_hashtag once with each returned value of tweet:

    $ pytest -v
    ...
    test_spam.py::test_hashtag[{"contributors":null,"coordinates":null,"created_at":"Sat Aug 20 01:00:12 +0000 2016","entities":{"hashtags":[{"indices":[97,116],"text":"StandWithLouisiana"}]}}]
    test_spam.py::test_hashtag[{"contributors":null,"coordinates":null,"created_at":"Sat Aug 20 01:01:35 +0000 2016","entities":{"hashtags":[]}}]
    ...
    

    Edit: extending the fixture to provide the expected value

    You can include the expected value into tweet fixture parameters, which are then passed through to the test unchanged. In the below example, the expected tags are zipped with the file lines to build pairs of the form (line, tag). The tweet fixture loads the line into a dictionary, passing the tag through, so the tweet argument in the test becomes a pair of values.

    import json
    import pathlib
    import pytest
    
    
    lines = pathlib.Path('data.json').read_text().split('\n')
    expected_tags = ['StandWithLouisiana', '']
    
    @pytest.fixture(params=zip(lines, expected_tags),
                    ids=tuple(repr(tag) for tag in expected_tags))
    def tweet(request):
        line, tag = request.param
        return (json.loads(line), tag)
    
    
    def hashtags(t):
        return ' '.join([h['text'] for h in t['entities']['hashtags']])
    
    
    def test_hashtag(tweet):
        data, tag = tweet
        assert hashtags(data) == tag
    

    The test run yields two tests as before:

    test_spam.py::test_hashtag['StandWithLouisiana'] PASSED
    test_spam.py::test_hashtag[''] PASSED
    

    Edit 2: using indirect parametrization

    Another and probably a more clean approach would be to let the tweet fixture only handle the parsing the tweet from the raw string, moving the parametrization to the test itself. I'm using the indirect parametrization to pass the raw line to the tweet fixture here:

    import json
    import pathlib
    import pytest
    
    
    lines = pathlib.Path('data.json').read_text().split('\n')
    expected_tags = ['StandWithLouisiana', '']
    
    @pytest.fixture
    def tweet(request):
        line = request.param
        return json.loads(line)
    
    
    def hashtags(t):
        return ' '.join([h['text'] for h in t['entities']['hashtags']])
    
    
    @pytest.mark.parametrize('tweet, tag', 
                             zip(lines, expected_tags),
                             ids=tuple(repr(tag) for tag in expected_tags),
                             indirect=('tweet',))
    def test_hashtag(tweet, tag):
        assert hashtags(tweet) == tag
    

    The test run now also yields two tests:

    test_spam.py::test_hashtag['StandWithLouisiana'] PASSED
    test_spam.py::test_hashtag[''] PASSED