Search code examples
pythonregexfilenamesglob

Python regular expressions (regex), convert JSON to text file for parsing


I annotated some video frames with VGG annotator that gives me the annotations in JSON format and want to parse it to extract values I need (x, y coordinates).
I have looked at other postings on this site but nothing seems to match my case as the length of the filename changes, ie. frame number 0 to 9 then 10 to 99, 100 to 999, 1000 to 9999, increasing by one digit.

I have tried import glob and using wildcard ranges, single characters and asterisks.

My code now:

#Edited 
while count < 1200:
    x = data[key]['regions']['0']['shape_attributes']['cx']
    y = data[key]['regions']['0']['shape_attributes']['cy']
    pts = (x, y)
    xy.append(pts)
    count += 1

f = open("coordinates.txt", "w")
f.write(xy)
f.close()  

JSON looks like:

        "shape_attributes": {
          "name": "point",
          "cx": 400,
          "cy": 121
        },
        "region_attributes": {}
      }
    }
  },
  "frame48.jpg78647": {
    "fileref": "",
    "size": 78647,
    "filename": "frame48.jpg",
    "base64_img_data": "",
    "file_attributes": {},
    "regions": {
      "0": {
        "shape_attributes": {
          "name": "point",
          "cx": 404,
          "cy": 114
        },
        "region_attributes": {}
      }
    }

Edit: I am going to convert the JSON to .txt file and parse that to get my values as I have no idea how to do so directly now.

I tried converting to string and parsing the string per below: This did the job of getting x, y coordinates (3 digit ints) only appended to a list which I am going to convert to a list of tuples of (x,y) and print to a text file for use later as labels for a neural network where I'm tracking coordinates of a tennis ball on tennis matches on TV.

xy.append(re.findall(r'\b\d\d\d\b', datatxt))

Solution

  • You can't wildcard keys in a dictionary. Do you actually care about the keys at all - are there entries you want to ignore, or are you happy to have any/all of them?

    If the keys are unimportant, then take data.values() which will be a list of the dictionaries, and you can go through the first 1,200 entries of that.

    If there are keys not in the format you give, then loop through them and check they match first:

    for key in data.keys():
        m = re.match('frame(\d+).jpg(\d+)$', key)
        if not m: continue
        f1, f2 = map(int, m.groups())
        if f1<0 or f1>1199 or f2<10000 or f2>99999: continue
        x = data[key]['regions']['0']['shape_attributes']['cx']
        y = data[key]['regions']['0']['shape_attributes']['cy']
        ...