Search code examples
pythondatasethuggingface-datasets

How can I take the unique rows of a Huggingface Dataset?


Huggingface Datasets have a unique method, which produces a list of unique vals for a particular column. This method is very fast.

I'd like to do something similar, with two differences:

  1. I need not just the first column (id) but also another column (answer). For every id, each answer is the same, so it's irrelevant, for a given id, which row we get answer from.

  2. I'd like the result to be a Dataset, not a list, because there are a lot of values and I'd rather not load them all into Python memory.

How can I do this?


Solution

  • As far as I know/understand from the current documentation, there is no way to do this unless you iterate twice from the dataset (without converting to pandas) and without using intermediate variables. I also read that other developers ran into the same problem, seeming like deduplication is not that straightforward as one would expect.

    At the time of writing this comment, there isn't any way in HF alone DIRECTLY to achieve what you want, unless you want to use pandas and then reconvert.

    If you still want to use only HF (without pandas), this could be a potential solution:

    I would approach it this way (you do need an intermediate list) :

        initial_list = dataset.filter(lambda example: example['id'], example['answer'])
        _ , unique_indices = np.unique(initial_list, return_index=True, axis=0)
        filtered_dataset = dataset.select(unique_indices.tolist())
    

    PS: I can understand/expect that this is not something you desired, but unfortunately there really isn't any built-in/off-the-shelves pure HF solution.