Search code examples
pythonpython-3.xpandasdataframeexternal-tables

How to read csv file and return external table query in python?


I am trying to read csv file and create a external table query by the dataframe. Please help me how can achieve my goal?

Example:

Sppose I have df like this-

df = pd.DataFrame({'A': [1,2,3], 'B': [True, False, False], 'C': ['a', 'b', 'c']})
print(df.dtypes)

A     int64
B      bool
C    object
dtype: object

I have to create external table based on the information given by dataframe-

CREATE EXTERNAL TABLE schema_name.table_name
(
A INT,
B VARCHAR(100),
C VARCHAR(100)
) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES 
(
'separatorChar' = ','
)
LOCATION 'location'
TABLE PROPERTIES ('skip.header.line.count'='1') ;

The conversion should be like this -

int64  - INT,
float64 - FLOAT,
object - VARCHAR(100),
bool - VARCHAR(10),
date - TIMESTAMP

Please help me to how can I create external table?


Solution

  • SQL query is normal string so you can format it as any other string.

    You could use for-loop with df.dtypes.items() to get name and dtype and convert to expecteds lines A INT and B VARCHAR(100) and C VARCHAR(100).

    And you can use dictonary to convert it

    convert = {
        "int64": "INT",
        "float64": "FLOAT",
        "object": "VARCHAR(100)",
        "bool": "VARCHAR(10)",
        "date": "TIMESTAMP",    
    }    
    

    Later you can use join with ",\n" to add comma in all lines except one line.

    And finally you can put it in string CREATE ... using f-string or .format()

    import pandas as pd
    
    df = pd.DataFrame({'A': [1,2,3], 'B': [True, False, False], 'C': ['a', 'b', 'c']})
    #print(df.dtypes)
    
    convert = {
        "int64": "INT",
        "float64": "FLOAT",
        "object": "VARCHAR(100)",
        "bool": "VARCHAR(10)",
        "date": "TIMESTAMP",    
    }    
        
    all_lines = []    
    for name, dtype in df.dtypes.items():
        dtype = str(dtype)
        line  = f'{name} {convert[dtype]}'
        #print(dtype, "=>", line)
        all_lines.append( line )
        
    text = ",\n".join(all_lines)
    
    print(f"""CREATE EXTERNAL TABLE schema_name.table_name
    (
    {text}
    ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
    WITH SERDEPROPERTIES 
    (
    'separatorChar' = ','
    )
    LOCATION 'location'
    TABLE PROPERTIES ('skip.header.line.count'='1') ;""")
    

    Result:

    CREATE EXTERNAL TABLE schema_name.table_name
    (
    A INT,
    B VARCHAR(10),
    C VARCHAR(100)
    ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
    WITH SERDEPROPERTIES 
    (
    'separatorChar' = ','
    )
    LOCATION 'location'
    TABLE PROPERTIES ('skip.header.line.count'='1') ;