Search code examples
pythonpandasdataframeschemaavro

Python - generate avro schema for csv/xls file


I have a XLS/CSV file which I'm reading into pandas dataframe. I want to generate an avro schema out of this dataframe.

I'm new to python as well as pandas. Kindly help.

data_frame = pd.read_excel(INPUT_PATH)

I want to generate an avro schema from this data frame on the fly. Please help


Solution

  • I found the solution to it. I extracted the datatypes of the field in the pandas dataframe and saved it against the field name.

    Mapped the data types to avro compatible data types ('object' in pandas -> 'string' in avro)

    Created a template of an avro schema and put the substituted the field names and data types inside the 'fields :[]' part and posted it to the registry.

    for instance :

        schema = {"type": "record",
                "name": schemaName,
              "fields": [
                  {"name": key, "type": value} for (key, value) in myDict.items()
              ]
              }
    

    Fastavro library can then be used to parse this schema