I have to filter texts that I process by checking if people's names appear in the text (texts
). If they do appear, the texts are appended as nested list of dictionaries to the existing list of dictionaries containing people's names (people
). However, since in some texts more than one person's name appears, the child document containing the texts will be repeated and added again. As a result, the child document does not contain a unique ID and this unique ID is very important, regardless of the texts being repeated.
Is there a smarter way of adding a unique ID even if the texts are repeated?
My code:
import uuid
people = [{'id': 1,
'name': 'Bob',
'type': 'person',
'_childDocuments_': [{'text': 'text_replace'}]},
{'id': 2,
'name': 'Kate',
'type': 'person',
'_childDocuments_': [{'text': 'text_replace'}]},
{'id': 3,
'name': 'Joe',
'type': 'person',
'_childDocuments_': [{'text': 'text_replace'}]}]
texts = ['this text has the name Bob and Kate',
'this text has the name Kate only ']
for text in texts:
childDoc={'id': str(uuid.uuid1()), #the id will duplicate when files are repeated
'text': text}
for person in people:
if person['name'] in childDoc['text']:
person['_childDocuments_'].append(childDoc)
Current output:
[{'id': 1,
'name': 'Bob',
'type': 'person',
'_childDocuments_': [{'text': 'text_replace'},
{'id': '7752597f-410f-11eb-9341-9cb6d0897972', #duplicate ID here
'text': 'this text has the name Bob and Kate'}]},
{'id': 2,
'name': 'Kate',
'type': 'person',
'_childDocuments_': [{'text': 'text_replace'},
{'id': '7752597f-410f-11eb-9341-9cb6d0897972', #duplicate ID here
'text': 'this text has the name Bob and Kate'},
{'id': '77525980-410f-11eb-b667-9cb6d0897972',
'text': 'this text has the name Kate only '}]},
{'id': 3,
'name': 'Joe',
'type': 'person',
'_childDocuments_': [{'text': 'text_replace'}]}]
As you can see in the current output, the ID for the text 'this text has the name Bob and Kate'
has the same identifier: '7752597f-410f-11eb-9341-9cb6d0897972'
, because it is appended twice. But I would like each identifier to be different.
Desired output:
Same as current output, except we want every ID to be different for every appended text even if these texts are the same/duplicates.
Move the generation of the UUID inside the inner loop:
for text in texts:
for person in people:
if person['name'] in text:
childDoc={'id': str(uuid.uuid1()),
'text': text}
person['_childDocuments_'].append(childDoc)
This does not actually ensure that the UUID are unique. For that you need to have a set of used UUID, and when generating a new one you check if it is already used and if it is you generate another. And test that one and repeat until you have either exhausted the UUID space or have found an unused UUID.
There is a 1 in 2**61 chance that duplicates are generated. I can't accept collisions as they result in data loss. So when I use UUID I have a loop around the generator that looks like this:
used = set()
while True:
identifier = str(uuid.uuid1())
if identifier not in used:
used.add(identifier)
break
The used set is actually stored persistently. I don't like this code although I have a program that uses it as it ends up in an infinite loop when it can't find a unused UUID.
Some document databases provide automatic UUID assignment and they do this for you internally to ensure that a given database instance never ends up with two documents with the same UUID.