Search code examples
pythonpandascsvlinked-tables

Retrieving rows from a table using pairs of IDs from a separate table


Given two files,

  • sentences_detailed.csv contains 3 columns (ID, LANG and TEXT)
  • links.csv contains 2 columns (ID1 and ID2), the ID* is linked to the ID column in sentences_detailed.csv

Sample of sentences_detailed.csv:

ID LANG TEXT
123 eng I want you to meet my parents.
456 eng I'm on your side.
789 eng I did not want to alarm you. 
567 kor 부모님을 만나길 원해.
2352 jpn 私はあなたの側にいます。

And sample of links.csv:

ID1 ID2
123 567
2352 456

How do I create a new DataFrame such that I get the respective LANG and TEXT column from sentences_detailed.csv using the pairs of IDs in links.csv?

The desired output DataFrame should look something like:

ID1 ID2 LANG1 LANG2 TEXT1 TEXT2
123 567 eng kor I want you to meet my parents. 부모님을 만나길 원해.
2352 456 jpn eng 私はあなたの側にいます。I'm on your side.

I've tried this:

sent_details_csv = """ID    LANG    TEXT
123 eng I want you to meet my parents.
456 eng I'm on your side.
789 eng I did not want to alarm you. 
567 kor 부모님을 만나길 원해.
2352    jpn 私はあなたの側にいます。"""

links_csv = """ID1  ID2
123 567
2352    456
"""

from io import StringIO

sent_details = pd.read_csv(StringIO(sent_details_csv), sep='\t')
links = pd.read_csv(StringIO(links_csv), sep='\t')

for idx, row in links.iterrows():
    src_idx, trg_idx = row['ID1'], row['ID2']

    try:
        src = sent_details[sent_details['ID'] == src_idx][['TEXT', 'LANG']].iloc[0]
        trg = sent_details[sent_details['ID'] == trg_idx][['TEXT', 'LANG']].iloc[0]
    except: 
        continue

    print('\t'.join(map(str, [src_idx, trg_idx, src['LANG'], trg['LANG'], src['TEXT'], trg['TEXT']])))

The code above works for a small dataset but the actual sentences_detailed.csv is 6,000,000+ rows and links.csv is ~13,000,000 rows.

Trying to filter out sent_details given a source and target idx for each links row gets becomes costly.

There must be a better more "pandas-ic" way to do this.


The full dataset is on https://www.kaggle.com/alvations/how-to-get-parallel-sentences-from-tatoeba

The answers from Dark, Zero and COLDSPEED are good but when there're duplicates in the sentences_detailed.csv and links.csv , they throw some errors.


Solution

  • One fast way of doing to is by divide and merge i.e

    one = df[df['ID'].isin(links['ID1'])].copy()
    two = df[df['ID'].isin(links['ID2'])].copy()
    
    two['NEW_ID'] = two['ID'].map(links.set_index('ID2')['ID1'])
    
    one.merge(two,left_on='ID',right_on='NEW_ID',suffixes=('1', '2'))
    
         ID1 LANG1                         TEXT1       ID2  LANG2       TEXT2      NEW_ID  
    0   123    eng     I want you to meet my parents.  567    kor  부모님을 만나길 원해.     123  
    1  2352    jpn             私はあなたの側にいます。    456    eng  I'm on your side.    2352 
    

    Mapping didn't work due to the presence of duplicate IDs, So you can use dual merge i.e

    one.merge(two.merge(links,left_on='ID',right_on='ID2'),left_on='ID',right_on='ID1',suffixes=('1', '2')))
    
     ID1 LANG1                           TEXT1  ID2 LANG2              TEXT2  \
    0   123   eng  I want you to meet my parents.  567   kor       부모님을 만나길 원해.   
    1  2352   jpn                    私はあなたの側にいます。  456   eng  I'm on your side.   
    
        ID1  ID2  
    0   123  567  
    1  2352  456  
    

    Based on the actual data you can simply cross merge same dataframe since you have the same column names i.e

    sec = sent_details.merge(links)
    
    sec.merge(sec, left_on=['Sentence id','Translation id'],right_on=['Translation id','Sentence id'], suffixes=(1,2))