Search code examples
pythonpandasnlp

How to separate Parts of Speech tags from Sentences and make them into two separate columns one with the raw sentence and one with only the POS tags


so I have a Bangla Parts of Speech Data-Set which looks like this:

রপ্তানি\JJ.n.n দ্রব্য\NC.0.0.n.n -\PU তাজা\JJ.n.n ও\CCD.n শুকনা\JJ.n.n ফল\NC.0.0.n.n ,\PU আফিম\NC.0.0.n.n ,\PU পশুচর্ম\NC.0.0.n.n ও\CCD.n পশম\NC.0.0.n.n এবং\CCD.n কার্পেট\NC.0.0.n.n ৷\PU
রাজা\NP.0.0.n.n মহানন্দ\NP.0.0.n.n রাজধানীতে\NC.0.loc.n.n তৈরি\NC.0.0.n.n করেছিল\VM.3.pst.pft.dcl.fin.n.n.n শিব\NP.0.0.n.n মন্দির\NC.0.0.n.n ও\CCD.n বৈষ্ণবদের\NC.0.gen.n.n মন্দির\NC.0.0.n.n ৷\PU
প্রতিটি\JQ.y.n.nnm বৌদ্ধ\JJ.n.n -\PU সন্ন্যাসী\NC.0.0.n.n ,\PU সন্ন্যাসিনী\NC.0.0.n.n বা\CCD.n গৃহস্থ\NC.0.0.n.n -\PU যেই\PRL.sg.0.n.n.y.n হোক\VM.3.prs.sim.sbj.fin.n.n.n না\CX.y কেন\CX.n ,\PU প্রাতে\NC.0.loc.n.n ,\PU দ্বিপ্রহরে\NC.0.loc.n.n ,\PU অপরাহ্নে\NC.0.loc.n.n ,\PU ও\CCD.n সন্ধ্যায়\NC.0.loc.n.n এই\DAB.0.n পবিত্র\JJ.n.n ত্রয়ীকে\NC.0.acc.n.n প্রণাম\NC.0.0.n.n ও\CCD.n ধ্যান\NC.0.0.n.n করে\VM.0.0.0.0.nfn.n.n.n ,\PU তাকে\PPR.sg.3.acc.n.n.n.n জপ\NC.0.0.n.n করে\VM.0.0.0.0.nfn.n.n.n এই\PPR.sg.3.0.n.n.n.n ব'লে\VM.0.0.0.0.nfn.n.n.n -\PU "\PU আমি\PPR.sg.1.0.n.n.n.n বুদ্ধের\NP.0.gen.n.n শরণাগত\JJ.n.n হলাম\VM.3.pst.sim.dcl.fin.n.n.n ৷\PU
বদাওনী\NP.0.0.n.n যে\CX.n খুব\JQ.n.n.nnm খুশি\JJ.n.n মনে\NC.0.loc.n.n অনুবাদের\NC.0.gen.n.n কাজে\NC.0.loc.n.n আত্মনিয়োগ\NC.0.0.n.n করেছিলেন\VM.3.pst.pft.dcl.fin.n.n.y তা\PPR.sg.3.0.n.n.n.n নয়\VM.3.prs.sim.dcl.fin.n.y.n ,\PU কারণ\CSB.n মহাভারতের\NP.0.gen.n.n ওই\DAB.sg.y অংশের\NC.0.gen.n.n বিষয়বস্তুর\NC.0.gen.n.n সঙ্গে\PP.0.n তাঁর\PPR.sg.3.gen.n.n.n.y গোঁড়া\JJ.n.n ধর্মবিশ্বাসের\NC.0.gen.n.n আদপে\CX.n কোন\JQ.n.n.nnm মিল\NC.0.0.n.n না\CX.y থাকায়\NV.loc.n.n তাঁর\PPR.sg.3.0.n.n.n.y কোনরকম\JQ.n.n.nnm মানসিক\JJ.n.n তৃপ্তি\NC.0.0.n.n হত\VM.3.pst.sim.hab.fin.n.n.n না\CX.y ,\PU সমস্ত\JQ.n.n.nnm পরিশ্রম\NC.0.0.n.n অর্থহীন\JJ.n.n মনে\NC.0.loc.n.n হত\VM.3.pst.sim.hab.fin.n.n.n ৷\PU

I have read the data-frame using Pandas:

import pandas as pd

df = pd.read_csv('base_dataset.txt', sep='delimiter', encoding ='utf-8', header=None)

df

OUTPUT: 

0   রপ্তানি\JJ.n.n দ্রব্য\NC.0.0.n.n -\PU তাজা\JJ....
1   রাজা\NP.0.0.n.n মহানন্দ\NP.0.0.n.n রাজধানীতে\N...
2   প্রতিটি\JQ.y.n.nnm বৌদ্ধ\JJ.n.n -\PU সন্ন্যাসী...
3   বদাওনী\NP.0.0.n.n যে\CX.n খুব\JQ.n.n.nnm খুশি\...
4   কয়েক\JQ.n.n.nnm বিঘা\CCL.n ধানী\JJ.n.n জমিও\NC...
5   মাটি\NC.0.0.n.n থেকে\PP.0.n বড়জোর\JQ.n.n.nnm চ...
6   তাদের\PPR.pl.3.gen.n.n.n.n চা\NC.0.0.n.n -\PU ...
7   নকল\JJ.n.n ওষুধের\NC.0.gen.n.n কেরামতি\NC.0.0....

My Query from you guys: I want to separate the Parts of Speech Tags from the Sentences and make two different columns. Column 1 would be the Bangla Sentences and Column 2 would be the corresponding POS Tags so that I could use it to feed it to a Bi-directional LSTM and train

Here is how the output should look like if I printed the First rows of both Columns:
Column 1 Row 1:
রপ্তানি দ্রব্য - তাজা ও শুকনা ফল, আফিম, পশুচর্ম ও পশম এবং কার্পেট ৷

Column 2 Row 1:
JJ.n.n NC.0.0.n.n PU JJ.n.n CCD.n JJ.n.n NC.0.0.n.n PU NC.0.0.n.n PU NC.0.0.n.n CCD.n NC.0.0.n.n CCD.n NC.0.0.n.n PU

Update: If Bangla is not understandable for you can you show me the procedure for doing it in the English Language? For example consider a file containing 1000's of english sentences as such:

People/NNS continue/VBP to/TO inquire/VB the/DT reason/NN for/IN the/DT race/NN for/IN outer/JJ space/NN 

Secretariat/NNP is/VBZ expected/VBN to/TO race/VB tomorrow/NN

What I basically want is to convert the Raw dataset into a data-set containing two columns; Column 1 containing just the plain sentences without the POS tags and Column 2 containing the labels as in the corresponding POS tags of the sentences in column 1.

I would like to do it for all the sentences in the data-set and I have attached the data-set here: POS Bangla Data-set

Please note I want to keep punctuation such as a comma which is denoted with the tag PU since it plays a role in determining the structure of the sentence.

Any help would be highly appreciated.


Solution

  • Following is an English example solved. The forward slashes are replaced with backslashes so as to be in line with the Bangla text provided.

    sample data for this code:

    People\NNS continue\VBP to\TO inquire\VB the\DT reason\NN for\IN the\DT race\NN for\IN outer\JJ space\NN 
    
    Secretariat\NNP is\VBZ expected\VBN to\TO race\VB tomorrow\NN
    
    import pandas as pd
    import re
    
    #import data
    df = pd.read_csv('base_dataset.txt', sep='delimiter', encoding ='utf-8', header=None, names=['sentences'], engine='python')
    
    #Build function to extract the POS tags
    def extractPos(s):
        line = str(s)
        matches = re.findall(r"(?<=\\).*?(?=\s)",line)
        posTag = " ".join(matches)
        return(posTag)
    #Build function to extract the tokens    
    def extractToken(s):
        line = str(s)
        matches = re.findall(r"(?<=\s).*?(?=\\)",line)
        words = " ".join(matches)
        return(words)
    #Add new columns to the existing dataframe
    df['posTag'] = df['sentences'].apply(lambda x: extractPos(str(x)))
    df['words'] = df['sentences'].apply(lambda x: extractToken(str(x)))