Search code examples
pythonapache-sparkkedro

Define column names when reading a spark dataset in kedro


With kedro, how can I define the column names when reading a spark.SparkDataSet? below my catalog.yaml.

user-playlists: 
  type: spark.SparkDataSet
  file_format: csv
  filepath: data/01_raw/lastfm-dataset-1K/userid-timestamp-artid-artname-traid-traname.tsv
  load_args:
    sep: "\t"
    header: False
#    schema:
#      filepath: conf/base/playlists-schema.json
  save_args:
    index: False

I have been trying to use the following schema, but it doesn't seem to be accepted (schema Pleaseprovide a valid JSON-serialised 'pyspark.sql.types.StructType'.. error)

{
  "fields": [
    {"name": "userid", "type": "string", "nullable": true},
    {"name": "timestamp", "type": "string", "nullable": true},
    {"name": "artid", "type": "string", "nullable": true},
    {"name": "artname", "type": "string", "nullable": true},
    {"name": "traid", "type": "string", "nullable": true},
    {"name": "traname", "type": "string", "nullable": true}
  ],
  "type": "struct"
}

Solution

  • this works

    {"fields":[
      {"metadata":{},"name":"userid","nullable":true,"type":"string"},
      {"metadata":{},"name":"timestamp","nullable":true,"type":"string"},
      {"metadata":{},"name":"artistid","nullable":true,"type":"string"},
      {"metadata":{},"name":"artistname","nullable":true,"type":"string"},
      {"metadata":{},"name":"traid","nullable":true,"type":"string"},
      {"metadata":{},"name":"traname","nullable":true,"type":"string"}
    ],"type":"struct"}