Search code examples
amazon-web-servicesetlaws-glue

AWS Glue Crawler defines one schema per file


I have the following data

{
  "0": "x",
  "1": [
    [
      "x",
      {
        "app_instance_id": "x",
        "app_instance_time": "x",
        "page": {
          "url": "x"
        },
        "user_agent": "x",
        "timestamp": "x",
        "session_id": "x",
        "permanent_id": "x",
        "event_category": "x",
        "customer": "x",
        "referrer": {
          "url": "x"
        },
        "ip_address": "x"
      }
    ],
    [
      "x",
      {
        "app_instance_id": "x",
        "app_instance_time": "x",
        "page": {
          "url": "x"
        },
        "user_agent": "x",
        "timestamp": "x",
        "session_id": "x",
        "permanent_id": "x",
        "event_category": "x",
        "customer": "x",
        "referrer": {
          "url": "x"
        },
        "ip_address": "x"
      }
    ]
  ],
  "time": 1627978464738
}{
  "event": "x",
  "userId": "x",
  "badgeId": null,
  "levelId": null,
  "projectId": "x",
  "ua": "x",
  "key": "x",
  "requestMethod": "x",
  "endpoint": "x",
  "customerId": "x",
  "durationMs": 0,
  "responseCode": 200,
  "time": 1627978465804
}{
  "event": "x",
  "userId": "x",
  "badgeId": null,
  "levelId": null,
  "projectId": "x",
  "ua": "x",
  "key": "x",
  "requestMethod": "GET",
  "endpoint": "x",
  "customerId": "x",
  "durationMs": 0,
  "responseCode": 200,
  "time": 1627978465798
}{
  "event": null,
  "ua": "x",
  "browser.name": "Firefox",
  "browser.version": "87.0",
  "browser.major": "87",
  "engine.name": "Gecko",
  "engine.version": "87.0",
  "os.name": "Mac OS",
  "os.version": "10.15",
  "lineCount": 3,
  "data": 20,
  "carrier": "x",
  "spendingNow": 200,
  "client": "x",
  "time": 1619185462317
}{
  "event": null,
  "ua": "x",
  "browser.name": "Chrome",
  "browser.version": "90.0.4430.66",
  "browser.major": "90",
  "engine.name": "Blink",
  "engine.version": "90.0.4430.66",
  "os.name": "Android",
  "os.version": "10",
  "device.vendor": "Samsung",
  "device.model": "SM-G965F",
  "device.type": "mobile",
  "lineCount": 1,
  "data": 25,
  "carrier": "x",
  "spendingNow": 10,
  "client": "x",
  "time": 1619201845480
}

As you can see, it contains json objects of different schemas in one file. However, when I use the glue crawler to define tables for my data, it creates one single table for the whole file, which contains all the columns in all of the json objects (like 0, 1, time, event, userId, badgeId etc.) as shown in the screenshot below.enter image description here

What I want to do, is tell the crawler to create multiple tables for each schema, like it does for separate files. What can I do?


Solution

  • I don't think you can. A schema is supposed to describe the structure of usually a directory of file(s). Having multiple schemas for a single file would not even allow to browse the data of this very file, and it wouldn't make any sense

    Best is to clean your data, or use separate files (in separate paths) with consistent schema if you really want to detect different schemas