Search code examples
pythonpandasindexingrecord-linkage

How to write an index block using two columns in pandas' record-linkage?


I want to make pairs of index on the condition that the info of two columns of the compared database are equal. Can this be implemented using the index class of record linkage?

# dfg and dfm are databases that both contain the columns 'N_name' and 'N_cp'
import recordlinkage as rl

indexer_try = rl.Index()
indexer_try.block('N_name','N_name','N_cp','N_cp')
candidate_links = indexer_try.index(dfg, dfm)

I expected the class to create the multiindex that contains indexes that match these criterion.

Instead I got, init() takes from 1 to 3 positional arguments but 5 were given


Solution

  • It is necessary to include the columns as an array

    # Indexation step
    import recordlinkage as rl
    
    indexer = rl.Index()
    indexer.block(['N_name'],['N_name']) # 25k
    indexer.block(['N_address', 'N_cp'],['N_address','N_cp']) #211k
    indexer.block('latlng', 'latlng') # 320k
    candidate_links = indexer.index(dfg, dfm)