Search code examples
mongodbpymongo

Insert into mongo by skipping duplicates along two indices


I have a mongodb collection which has the following data (inserted using insert_many)

[
    {"attr_name": "a", "values": [
        {"value": "1", "embedding": [1,2,3]},
        {"value": "2", "embedding": [2,3,4]},
        {"value": "3", "embedding": [3,4,5]},
    ]},
    {"attr_name": "b", "values": [
        {"value": "1", "embedding": [1,2,3]},
        {"value": "4", "embedding": [4,5,6]},
    ]},
    {"attr_name": "c", "values": [
        {"value": "6", "embedding": [6,7,8]},
        {"value": "7", "embedding": [7,8,9]},
    ]},
]

I want duplicates to be avoided on the attr_name and value. This is enforced by

collection.create_index(["attr_name", "value"], unique=True)

What I want is, when new data is inserted, if there's a matching index for attr_name, it should append to the values. But now, if there's a matching attr_name, it omits the entire entry.

For example: I have this:

[
    {"attr_name": "a", "values": [
        {"value": "1", "embedding": [1,2,3]},
        {"value": "2", "embedding": [2,3,4]},
    ]},
    {"attr_name": "b", "values": [
        {"value": "1", "embedding": [1,2,3]},
        {"value": "4", "embedding": [4,5,6]},
    ]},
]

I'm inserting this:

[
    {"attr_name": "a", "values": [
        {"value": "1", "embedding": [1,2,3]},
        {"value": "5", "embedding": [5,6,7]},
        {"value": "6", "embedding": [6,7,8]},
    ]},
    {"attr_name": "c", "values": [
        {"value": "6", "embedding": [6,7,8]},
        {"value": "7", "embedding": [7,8,9]},
    ]},
]

I want this to be the final state:

[
    {"attr_name": "a", "values": [
        {"value": "1", "embedding": [1,2,3]},
        {"value": "2", "embedding": [2,3,4]},    # <---- appended
        {"value": "5", "embedding": [5,6,7]},
        {"value": "6", "embedding": [6,7,8]},
    ]},
    {"attr_name": "b", "values": [
        {"value": "1", "embedding": [1,2,3]},
        {"value": "4", "embedding": [4,5,6]},
    ]},
    {"attr_name": "c", "values": [
        {"value": "6", "embedding": [6,7,8]},
        {"value": "7", "embedding": [7,8,9]},
    ]},
]

Solution

  • I think you may need to issue individual update_one commands with upsert=True.

    Perhaps something like:

    my_updates = [
        {"attr_name": "a", "values": [
            {"value": "1", "embedding": [1,2,3]},
            {"value": "5", "embedding": [5,6,7]},
            {"value": "6", "embedding": [6,7,8]},
        ]},
        {"attr_name": "c", "values": [
            {"value": "6", "embedding": [6,7,8]},
            {"value": "7", "embedding": [7,8,9]},
        ]},
    ]
    
    for update in my_updates:
        update_result = collection.update_one(
            filter={"attr_name": update["attr_name"]},
            update={"$addToSet": {"values": {"$each": update["values"]}}},
            upsert=True
        )
    

    N.B.: You may want to inspect the values of update_result properties.