Search code examples
pythonnltkdata-extractioninformation-extraction

Natural language processing - extracting data


I need help with processing unstructured data of day-trading/swing-trading/investment recommendations. I've the unstructured data in the form of CSV.

Following are 3 sample paragraphs from which data needs to be extracted:

Chandan Taparia of Anand Rathi has a buy call on Coal India Ltd. with an intra-day target price of Rs 338 . The current market price of Coal India Ltd. is 325.15 . Chandan Taparia recommended to keep stop loss at Rs 318 .

Kotak Securities Limited has a buy call on Engineers India Ltd. with a target price of Rs 335 .The current market price of Engineers India Ltd. is Rs 266.05 The analyst gave a year for Engineers India Ltd. price to reach the defined target. Engineers India enjoys a healthy market share in the Hydrocarbon consultancy segment. It enjoys a prolific relationship with few of the major oil & gas companies like HPCL, BPCL, ONGC and IOC. The company is well poised to benefit from a recovery in the infrastructure spending in the hydrocarbon sector.

Independent analyst Kunal Bothra has a sell call on Ceat Ltd. with a target price of Rs 1150 .The current market price of Ceat Ltd. is Rs 1199.6 The time period given by the analyst is 1-3 days when Ceat Ltd. price can reach the defined target. Kunal Bothra maintained stop loss at Rs 1240.

Its been a challenge extracting 4 information out of the paragraphs: each recommendation is differently framed but essentially has

  1. Target Price
  2. Stop Loss Price
  3. Current Price.
  4. Duration

and not necessarily all the information will be available in all the recommendations - every recommendation will atleast have Target Price.

I was trying to use regular expressions, but not very successful, can anyone guide me how to extract this information may be using nltk?

Code I've so far in cleaning the data:

import pandas as pd
import re

#etanalysis_final.csv has 4 columns with 
#0th Column having data time
#1st Column having a simple hint like 'Sell  Ceat Ltd.  target Rs  1150  :   Kunal Bothra,Sell  Ceat Ltd.  at a price target of Rs  1150  and a stoploss at Rs  1240  from entry point', not all the hints are same, I can rely on it for recommender, Buy or Sell, which stock.
#4th column has the detailed recommendation given.

df = pd.read_csv('etanalysis_final.csv',encoding='ISO-8859-1')
df.DATE = pd.to_datetime(df.DATE)
df.dropna(inplace=True)
df['RECBY'] = df['C1'].apply(lambda x: re.split(':|\x96',x)[-1].strip())
df['ACT'] = df['C1'].apply(lambda x: x.split()[0].strip())
df['STK'] = df['C1'].apply(lambda x: re.split('\.|\,|:| target| has| and|Buy|Sell| with',x)[1])
#Getting the target price - not always correct
df['TGT'] = df['C4'].apply(lambda x: re.findall('\d+.', x)[0])
#Getting the stop loss price - not always correct
df['STL'] = df['C4'].apply(lambda x: re.findall('\d+.\d+', x)[-1])

Solution

  • I got the solution :

    Code here contains only solution part of the question asked. It shall be possible to greatly improve this solution using fuzzywuzzy library.

    from nltk import word_tokenize      
    periods = ['year',"year's", 'day','days',"day's", 'month', "month's", 'week',"week's", 'intra-day', 'intraday']
    stop = ['target', 'current', 'stop', 'period', 'stoploss']
    
    def extractinfo(row):
        if 'intra day' in row.lower():
            row = row.lower().replace('intra day', 'intra-day')
        tks = [ w for w in word_tokenize(row) if any([w.lower() in stop, isfloat(w)])]
        tgt = ''
        crt = ''
        stp = ''
        prd = ''
        if 'target' in tks:
            if len(tks[tks.index('target'):tks.index('target')+2]) == 2:
                tgt = tks[tks.index('target'):tks.index('target')+2][-1]
        if 'current' in tks:
            if len(tks[tks.index('current'):tks.index('current')+2]) == 2:
                crt = tks[tks.index('current'):tks.index('current')+2][-1]
        if 'stop' in tks:
            if len(tks[tks.index('stop'):tks.index('stop')+2]) == 2:
                stp = tks[tks.index('stop'):tks.index('stop')+2][-1]
        prdd = set(periods).intersection(tks)       
        if 'period' in tks:
            pdd = tks[tks.index('period'):tks.index('period')+3]
            prr = set(periods).intersection(pdd)
            if len(prr) > 0:
                if len(pdd) > 2:
                    prd = ' '.join(pdd[-2::1])
                elif len(pdd) == 2:
                    prd = pdd[-1]
        elif len(prdd) > 0:
            prd = list(prdd)[0]
        return (crt, tgt, stp, prd)
    

    Solution is relatively self explanatory - otheriwse please let me know.