Search code examples
pythonloopspython-polars

Polars looping through the rows in a dataset


I am trying to loop through a Polars recordset using the following code:

import polars as pl

df = pl.DataFrame({
    "start_date": ["2020-01-02", "2020-01-03", "2020-01-04"],
    "Name": ["John", "Joe", "James"]
})

for row in df.rows():
    print(row)
('2020-01-02', 'John')
('2020-01-03', 'Joe')
('2020-01-04', 'James')

Is there a way to specifically reference 'Name' using the named column as opposed to the index? In Pandas this would look something like:

import pandas as pd

df = pd.DataFrame({
    "start_date": ["2020-01-02", "2020-01-03", "2020-01-04"],
    "Name": ["John", "Joe", "James"]
})

for index, row in df.iterrows():
    df['Name'][index]
'John'
'Joe'
'James'

Solution

  • You can specify that you want the rows to be named

    for row in mydf.rows(named=True):
        print(row)
    

    It will give you a dict:

    {'start_date': '2020-01-02', 'Name': 'John'}
    {'start_date': '2020-01-03', 'Name': 'Joe'}
    {'start_date': '2020-01-04', 'Name': 'James'}
    

    You can then call row['Name']

    Note that:

    • previous versions returned namedtuple instead of dict.
    • it's less memory intensive to use iter_rows
    • overall it's not recommended to iterate through the data this way

    Row iteration is not optimal as the underlying data is stored in columnar form; where possible, prefer export via one of the dedicated export/output methods.