Search code examples
pythonnested-listsunique-values

Python: eliminate duplicate nested lists


I am building a tagging model for natural language processing. Initially, the words of a sentence are tagged as a part of speech (like NN for noun), then the rules are applied that divide them into trees which are represented as nested lists. This process iterates many times until you get one node at the top level. I have a master list of all potential trees and I need to eliminate duplicate trees or the whole thing blows up in memory. Here is a small sample of what a list consists of. I need to make sure that each list in the list is unique as each iteration creates many branches.

[[('NP', [('PRP', 'I')]), ('VBD', 'ate'), ('DT', 'a'), ('NN', 'steak'), ('IN', 'with'), ('DT', 'a'), ('NN', 'knife'), ('.', '.')]
[('PRP', 'I'), ('VP', [('VBD', 'ate')]), ('DT', 'a'), ('NN', 'steak'), ('IN', 'with'), ('DT', 'a'), ('NN', 'knife'), ('.', '.')]
[('PRP', 'I'), ('VBD', 'ate'), ('NP', [('DT', 'a')]), ('NN', 'steak'), ('IN', 'with'), ('DT', 'a'), ('NN', 'knife'), ('.', '.')]
...]

I thought of using a set but lists aren't hashable. I tried using recursion and it runs out of memory. I thought about converting the lists to strings, using the string as a dictionary key and the list as the value, then iterating over and turning it back into a list again (or keep it as a dictionary?). Does anyone have a less hackish solution? I'm relatively new to Python so please provide an explanation of how your solution works.

I should clarify: the nested lists can be indefinitely deep. The tree structure is not known ahead of time but is built on the fly. Trying to build something like this -http://jos.oxfordjournals.org/content/25/4/345/F23.large.jpg but in the form of a nested list.


Solution

  • Thanks to sdasdadas for pointing out the duplicate. I was able to (more or less) solve it by creating this function:

     def list_to_tuple(a_list):
        temp_list = []
        for item in a_list:
            if isinstance(item, list) or isinstance(item, tuple):
                temp_list.append(list_to_tuple(item))
            else:
                temp_list.append(item)
        return tuple(temp_list)
    

    It takes an indefinitely deep nested list or tuples and gives back the equivalent thing in tuples. In another function, I pass this through a set to ensure unique values then it can go back to a list or whatever.

    Thanks for your answers.