Given two files,
sentences_detailed.csv
contains 3 columns (ID
, LANG
and TEXT
)links.csv
contains 2 columns (ID1
and ID2
), the ID*
is linked to the ID
column in sentences_detailed.csv
Sample of sentences_detailed.csv
:
ID LANG TEXT
123 eng I want you to meet my parents.
456 eng I'm on your side.
789 eng I did not want to alarm you.
567 kor 부모님을 만나길 원해.
2352 jpn 私はあなたの側にいます。
And sample of links.csv
:
ID1 ID2
123 567
2352 456
How do I create a new DataFrame such that I get the respective LANG
and TEXT
column from sentences_detailed.csv
using the pairs of IDs in links.csv
?
The desired output DataFrame should look something like:
ID1 ID2 LANG1 LANG2 TEXT1 TEXT2
123 567 eng kor I want you to meet my parents. 부모님을 만나길 원해.
2352 456 jpn eng 私はあなたの側にいます。I'm on your side.
I've tried this:
sent_details_csv = """ID LANG TEXT
123 eng I want you to meet my parents.
456 eng I'm on your side.
789 eng I did not want to alarm you.
567 kor 부모님을 만나길 원해.
2352 jpn 私はあなたの側にいます。"""
links_csv = """ID1 ID2
123 567
2352 456
"""
from io import StringIO
sent_details = pd.read_csv(StringIO(sent_details_csv), sep='\t')
links = pd.read_csv(StringIO(links_csv), sep='\t')
for idx, row in links.iterrows():
src_idx, trg_idx = row['ID1'], row['ID2']
try:
src = sent_details[sent_details['ID'] == src_idx][['TEXT', 'LANG']].iloc[0]
trg = sent_details[sent_details['ID'] == trg_idx][['TEXT', 'LANG']].iloc[0]
except:
continue
print('\t'.join(map(str, [src_idx, trg_idx, src['LANG'], trg['LANG'], src['TEXT'], trg['TEXT']])))
The code above works for a small dataset but the actual sentences_detailed.csv
is 6,000,000+ rows and links.csv
is ~13,000,000 rows.
Trying to filter out sent_details
given a source and target idx for each links
row gets becomes costly.
There must be a better more "pandas-ic" way to do this.
The full dataset is on https://www.kaggle.com/alvations/how-to-get-parallel-sentences-from-tatoeba
The answers from Dark, Zero and COLDSPEED are good but when there're duplicates in the sentences_detailed.csv and links.csv , they throw some errors.
One fast way of doing to is by divide and merge i.e
one = df[df['ID'].isin(links['ID1'])].copy()
two = df[df['ID'].isin(links['ID2'])].copy()
two['NEW_ID'] = two['ID'].map(links.set_index('ID2')['ID1'])
one.merge(two,left_on='ID',right_on='NEW_ID',suffixes=('1', '2'))
ID1 LANG1 TEXT1 ID2 LANG2 TEXT2 NEW_ID
0 123 eng I want you to meet my parents. 567 kor 부모님을 만나길 원해. 123
1 2352 jpn 私はあなたの側にいます。 456 eng I'm on your side. 2352
Mapping didn't work due to the presence of duplicate IDs, So you can use dual merge i.e
one.merge(two.merge(links,left_on='ID',right_on='ID2'),left_on='ID',right_on='ID1',suffixes=('1', '2')))
ID1 LANG1 TEXT1 ID2 LANG2 TEXT2 \
0 123 eng I want you to meet my parents. 567 kor 부모님을 만나길 원해.
1 2352 jpn 私はあなたの側にいます。 456 eng I'm on your side.
ID1 ID2
0 123 567
1 2352 456
Based on the actual data you can simply cross merge same dataframe since you have the same column names i.e
sec = sent_details.merge(links)
sec.merge(sec, left_on=['Sentence id','Translation id'],right_on=['Translation id','Sentence id'], suffixes=(1,2))