Change mixpanel Json data with different delimiters to dataframe

I'm having trouble processing the data I scraped from Mixpanel regarding the "ProductDetails" event. The data is structured as shown below:


[{'event': 'ProductDetails',
  'properties': {'time': 1682992328,
   'distinct_id': '7328f204-92b8-cef5eacc8f6f',
   '$app_build_number': '57',
   '$app_version_string': '1.11.1',
    '$screen_width': 1080,
   'eventName': 'ProductDetails',
   'mp_country_code': 'US',
   'mp_processing_time_ms': 1683017566066,
   'timestamp': '1683017528230',
   '{ProductName=FL 100 TABS., ProductId=9a6cfdde-e6bf34b669ec, source=list}': ''}},
 {'event': 'ProductDetails',
  'properties': {'time': 1683723803,
   'distinct_id': 'bcbf4f42--ee1567f4a14e',
   '$app_build_number': '57',
   '$app_release': '57',
   '$screen_height': 2316,
   'eventName': 'ProductDetails',
   'mp_country_code': 'US',
   'mp_processing_time_ms': 1683749007744,
   'timestamp': '1683749003851',
   '{ProductName=ADVANCE 24 TABLETS, ProductId=a6b60bcd-73ed2321b8af, source=list}': ''}},
{'event': 'ProductDetails',
  'properties': {'time': 1683814991,
   'distinct_id': '66bfd1d5--9fb70ccc153a',
   '$app_build_number': '57',
   '$app_release': '57',
      '$screen_dpi': 480,
   '$screen_height': 2158,
   '$screen_width': 1080,
   '$wifi': True,
   'eventName': 'ProductDetails',
   'mp_country_code': 'CA',
   'mp_lib': 'android',
   'mp_processing_time_ms': 1683840191858,
   'timestamp': '1683843722541',
   '{ProductName= Tav 30 CAP., ProductId=a830804e-80e8a03d3dae, source=list}': ''}}]

I have attempted to convert this data into a CSV file using the provided code, but the headers are incorrect and some data is missing. I suspect the issue lies with the last part of each object:

'{ProductName= Tav 30 CAP., ProductId=a830804e-80e8a03d3dae, source=list}': ''

Here's the modified code that attempts to convert the data to CSV:

import csv
# Flatten the data
flattened_data = []
for item in data:
    properties = item['properties']
    flattened_item = {}
    for key, value in properties.items():
        if isinstance(value, dict):
            flattened_item.update(value)
        else:
            flattened_item[key] = value
    flattened_data.append(flattened_item)

# Extract column headers
column_names = set()
for item in flattened_data:
    column_names.update(item.keys())
column_names.discard('')

# Write data to CSV
with open('output.csv', 'w', newline='') as csvfile:
    writer = csv.DictWriter(csvfile, fieldnames=column_names)
    writer.writeheader()
    writer.writerows(flattened_data)

I need assistance in properly converting the data to a CSV file with accurate headers and complete data.

Solution

While technically nested, it looks like you only have one sub-object one level deep to deal with. For data this simple (looking), I recommend a very direct and explicit approach.

For an object like:

[
    {
        "event": "Foo",
        "properties": {
            "time": 1,
            "distinct_id": "a",
            "$screen_width": 1000,
            "{ProductName=X, ProductId=xx, source=list}": ""
        }
    },
    {
        "event": "Foo",
        "properties": {
            "time": 2,
            "distinct_id": "b",
            "$screen_height": 2000,
            "{ProductName=Y, ProductId=yy, source=list}": ""
        }
    },
    {
        "event": "Bar",
        "properties": {
            "time": 3,
            "distinct_id": "c",
            "$screen_dpi": 100,
            "{ProductName= Z, ProductId=zz, source=list}": ""
        }
    }
]

"lift" the key-value pairs in the properties object up the main object:

with open('input.json', encoding='utf-8') as f:
    data = json.load(f)

# "Lift" properties up to level of event
for x in data:
    x.update(**x["properties"])
    del x["properties"]

x.update(**x["properties"]) means "take all the key-value pairs under properties and add them to the main (event) dict".

A an event like:

{
    "event": "Foo",
    "properties": {
        "time": 1,
        "distinct_id": "a",
        "$screen_width": 1000,
        "{ProductName=X, ProductId=xx, source=list}": ""
    }
}

becomes:

{
    "event": "Foo",
    "properties": {
        "time": 1,
        "distinct_id": "a",
        "$screen_width": 1000,
        "{ProductName=X, ProductId=xx, source=list}": ""
    }
    "time": 1,
    "distinct_id": "a",
    "$screen_width": 1000,
    "{ProductName=X, ProductId=xx, source=list}": ""
}

then delete the original properties key-value pair (del x["properties"]) and we have:

{
    "event": "Foo",
    "time": 1,
    "distinct_id": "a",
    "$screen_width": 1000,
    "{ProductName=X, ProductId=xx, source=list}": ""
}

Then move on to processing the weird "{ProductName=X, ProductId=xx, ...}": "" key-value pairs. I've put a few checks in to make sure the string basically conforms to the parser's expectations:

# Parse "{ProductName=X, ProductId=xx, ...}" key into constituent key-value pairs
for i, x in enumerate(data, start=1):
    for key in list(x.keys()):
        if "ProductName" not in key:
            continue

        key_ = key  # copy to preserve original key for deleting at end

        if key_[0] != "{" or key_[-1] != "}":
            print(f'error: event {i}, expected "{key_}" to be bracketed with {{...}}')
            continue

        key_ = key_[1:-1]

        if key_.count("=") - key_.count(",") != 1:
            print(f'error: event {i}, expected "{key_}" to contain one more equal sign than comma')
            continue

        for kv_pair in key_.split(","):
            k, v = kv_pair.split("=")
            x[k.strip()] = v.strip()

        del x[key]

All that yields the list of dicts:

[
    {'event': 'Foo', 'time': 1, 'distinct_id': 'a', '$screen_width': 1000, 'ProductName': 'X', 'ProductId': 'xx', 'source': 'list'},
    {'event': 'Foo', 'time': 2, 'distinct_id': 'b', '$screen_height': 2000, 'ProductName': 'Y', 'ProductId': 'yy', 'source': 'list'},
    {'event': 'Bar', 'time': 3, 'distinct_id': 'c', '$screen_dpi': 100, 'ProductName': 'Z', 'ProductId': 'zz', 'source': 'list'},
]

Different dicts in that list have different keys, so we need one more loop to collect the final set of all keys:

final_keys = {}
for x in data:
    final_keys.update(**x)

print(list(final_keys.keys()))

prints:

['event', 'time', 'distinct_id', '$screen_width', 'ProductName', 'ProductId', 'source', '$screen_height', '$screen_dpi']

I used a dict to preserve the key ordering as best as possible. If you need a specific ordering, the easiest way would be to just hard-code a list of fieldnames like:

final_keys = ['event', 'time', 'distinct_id', '$screen_width', '$screen_height', ...]

We can use the csv module's DictWriter with its restval keyword to instruct it how to deal with any dicts that don't have all the keys (in my example, based on yours, no single dict has every key) and give it the entire dict of final_keys (no need to call its .keys() method):

with open("output.csv", "w", newline="", encoding="utf-8") as f:
    writer = csv.DictWriter(f, fieldnames=final_keys, restval="NULL")
    writer.writeheader()
    writer.writerows(data)

and that produces the following CSV:

| event | time | distinct_id | $screen_width | ProductName | ProductId | source | $screen_height | $screen_dpi |
|-------|------|-------------|---------------|-------------|-----------|--------|----------------|-------------|
| Foo   | 1    | a           | 1000          | X           | xx        | list   | NULL           | NULL        |
| Foo   | 2    | b           | NULL          | Y           | yy        | list   | 2000           | NULL        |
| Bar   | 3    | c           | NULL          | Z           | zz        | list   | NULL           | 100         |