I have a python list. In this list I need to compare every item against the others and replace the shorter strings with the longest ones.
EDIT: I have a list of peoples names that I get using the Spacy module and it's entity extraction. I get back a list where sometimes it's the full name, sometimes part of the name. I want to normalize this list so it's always the full name (or the longest name in the article). This will help me determine who the most prominent/mentioned person in the article is.
small_example = ['David', 'David Stevens', 'Steve Martin' ]
small_example_outcome = [ 'David Stevens','David Stevens', 'Steve Martin']
Full Example:
person_list = [ 'Omarosa Manigault Newman', 'Manigault Newman','Trump', 'Apprentice', 'Mark Burnett', Manigault Newman','TAPES', 'Omarosa', 'Donald J. Trump','Omarosa', 'Donald J. Trump', 'Jacques Derrida', 'Derrida', 'Sigmund Freud', 'Mark Burnett', 'Manigault Newman', 'Manigault Newman', 'Trump', 'Mark Burnett' ]
Ideally what I'd have in the end is:
corrected_list = [ 'Omarosa Manigault Newman', 'Omarosa Manigault Newman', 'Donald J. Trump', 'Apprentice', 'Mark Burnett', 'Omarosa Manigault Newman', 'TAPES', 'Omarosa', 'Donald J. Trump', 'Omarosa Manigault Newman', 'Donald J. Trump', 'Jacques Derrida', 'Jacques Derrida', 'Sigmund Freud', 'Mark Burnett', 'Omarosa Manigault Newman', 'Omarosa Manigault Newman', 'Donald J. Trump', 'Mark Burnett' ]
But a list like this would work too:
normalized_list = ['Omarosa Manigault Newman', 'Apprentice', 'Mark Burnett', 'TAPES', 'Jacques Derrida', 'Donald J. Trump', 'Sigmund Freud']
I think what you're looking for is whether each string is a substring of another string in the list?
If the list is pretty short, like this one, we can do that with a stupid quadratic search:
corrected_list = []
for person in person_list:
matches = (other for other in person_list if person in other)
longest = max(matches, key=len)
corrected_list.append(longest)
If your list were huge, this would be too slow, and we'd need to do something cleverer, like building prefix and suffix tries. But for something this small, I think that's overkill.