Search code examples
pythonnlptext-classification

How to convert list of tokens (after sentence tokenization) in a paragraph format into a numbered list of sentences or convert it to a dataframe?


I read a pdf file using PDFMiner and extracted the text from it for NLP analysis. As I will be dealing with research articles, I did light cleaning of texts by converting the paragraphs of texts into list of sentence tokens. My goal is to select sentences that contains intext citations for my further analysis.

for instance, the data is in the below format:

['this is my new project' , 'I am very excited about this  (Abbasi, 2015)'] 

Expected output:

1.This is my new project
2.I am very excited about this (Abbasi, 2015)

Is this possible to convert this into a dataframe so that I can add labels to each sentences?

Or will it be wise to extract only the sentences with in-text citations?


Solution

  • To distinguish whether sentences contain intext citation or not, you can simply use regular expression as follow:

    i=[] 
    for i in sentences:
        if re.match(pattern, i):
           print("label (1)")
           indices.append(i)
        else: print("label (0)") or pass
    

    When pattern matched, append the indices of the connected sentences into an array. Finally, turn them into a CSV dataframe.

    NB: Since articles come up with different citation styles, check RE rules to customize your own pattern.