Search code examples
pythontextpython-re

How to remove text between two delimiters in Python


I am trying to remove all text between the [] brackets after the phrase '"segmentation":' Please see below snippet from file for context.

 "annotations": [
        {
            "id": 1,
            "image_id": 1,
            "segmentation": [
                [
                    621.63,
                    1085.67,
                    621.63,
                    1344.71,
                    841.66,
                    1344.71,
                    841.66,
                    1085.67
                ]
            ],
            "iscrowd": 0,
            "bbox": [
                621.63,
                1085.67,
                220.02999999999997,
                259.03999999999996
            ],
            "area": 56996,
            "category_id": 1124044
        },
        {
            "id": 2,
            "image_id": 1,
            "segmentation": [
                [
                    887.62,
                    1355.7,
                    887.62,
                    1615.54,
                    1114.64,
                    1615.54,
                    1114.64,
                    1355.7
                ]
            ],
            "iscrowd": 0,
            "bbox": [
                887.62,
                1355.7,
                227.0200000000001,
                259.8399999999999
            ],
            "area": 58988,
            "category_id": 1124044
        },
        {
            "id": 3,
            "image_id": 1,
            "segmentation": [
                [
                    1157.61,
                    1411.84,
                    1157.61,
                    1661.63,
                    1404.89,
                    1661.63,
                    1404.89,
                    1411.84
                ]
            ],
            "iscrowd": 0,
            "bbox": [
                1157.61,
                1411.84,
                247.2800000000002,
                249.7900000000002
            ],
            "area": 61768,
            "category_id": 1124044
        },
        ........... and so on.....

I ultimately just want to delete all text between the square brackets after the word segmentation appears. In other words, the output to look like (for the first instance):

"annotations": [
            {
                "id": 1,
                "image_id": 1,
                "segmentation": [],
                "iscrowd": 0,
                "bbox": [
                    621.63,
                    1085.67,
                    220.02999999999997,
                    259.03999999999996
                ],
                "area": 56996,
                "category_id": 1124044
            },

I've tried using the below code, but not quite having the luck currently. Is there something I am getting wrong due to the new lines?

import re
f = open('samplfile.json')
text = f.read()
f.close()

clean = re.sub('"segmentation":(.*)\]', '', text)

print(clean)

f = open('cleanedfile.json', 'w')
f.write(clean)
f.close()

I appreciate that the exact positioning I have for the [s in the clean line may not be quite right, but this code isn't removing anything at the moment.


Solution

  • Python has a built in json module for parsing and modifying JSON. A regular expression is likely to be fragile and more headache than it's probably worth.

    You can do the following:

    import json
    
    with open('samplfile.json') as input_file, open('output.json', 'w') as output_file:
        data = json.load(input_file)
        for i in range(len(data['annotations'])):
            data['annotations'][i]['segmentation'] = []
    
        json.dump(data, output_file, indent=4)
    

    Then, output.json contains:

    {
        "annotations": [
            {
                "id": 1,
                "image_id": 1,
                "segmentation": [],
                "iscrowd": 0,
                "bbox": [
                    621.63,
                    1085.67,
                    220.02999999999997,
                    259.03999999999996
                ],
                "area": 56996,
                "category_id": 1124044
            },
            {
                "id": 2,
                "image_id": 1,
                "segmentation": [],
                "iscrowd": 0,
                "bbox": [
                    887.62,
                    1355.7,
                    227.0200000000001,
                    259.8399999999999
                ],
                "area": 58988,
                "category_id": 1124044
            },
            {
                "id": 3,
                "image_id": 1,
                "segmentation": [],
                "iscrowd": 0,
                "bbox": [
                    1157.61,
                    1411.84,
                    247.2800000000002,
                    249.7900000000002
                ],
                "area": 61768,
                "category_id": 1124044
            }
        ]
    }