Search code examples
apache-arrow-datafusion

Read CSV into DataFusion DataFrame with Python


How can I read a CSV into a DataFusion DataFrame with datafusion-python?

Here's what I have so far:

import datafusion

ctx = datafusion.SessionContext()

I couldn't find any instructions in the docs.

I am using DataFusion v0.6.0.


Solution

  • There is some documentation here - https://github.com/apache/arrow-datafusion/blob/master/docs/source/python/index.rst

    Here is one of the examples:

    import datafusion
    from datafusion import functions as f
    from datafusion import col
    import pyarrow
    
    # create a context
    ctx = datafusion.SessionContext()
    
    # register a CSV
    ctx.register_csv('example', 'example.csv')
    
    # create a new statement via SQL
    df = ctx.sql("SELECT a+b, a-b FROM example")
    
    # execute and collect the first (and only) batch
    result = df.collect()[0]
    
    assert result.column(0) == pyarrow.array([5, 7, 9])
    assert result.column(1) == pyarrow.array([-3, -3, -3])
    

    There is work under way to move the documentation to the datafusion-python repo (see https://github.com/apache/arrow-datafusion/issues/2866)