Search code examples
pythonnlpstock

How would I detect a pattern in text?


I am creating a Python script that is able to understand human inputted text for actions to make on the stock market. For example all of these mean the same thing:

  • $EXAMPLE 12 Aug 21 $22.5 Call Average Price = $0.8
  • $EXAMPLE $22.5 CALLS EXP. 8/12 @ $0.8
  • $EXAMPLE $22.5 CALLS EXP. 8/12 @ $0.8
  • $EXAMPLE $22.5c 8/12 @ .8 
  • $EXAMPLE 22.5c 8/12 @ .8
  • Lorem ipsum dolor sit amet, consectetur $EXAMPLE 22.5c 8/12 @ .8 adipiscing elit, sed do eiusmod

The components that I need to extract: the ticker ($EXAMPLE), the call price ($22.5), expiration (8/12 or August 12), and the average price ($0.8). The issue is that sometimes formatting differs as shown in the examples above. Sometimes the 'calls' will be denoted as 22.5c and sometimes the average price may be written as Average Price = 0.8 or @ 0.8. Something else to note: sometimes the string may have text prefixing or following the parts I would like to extract (as shown in the last example).

How should I approach this? Will machine learning be useful in this case since the formatting isn't the same every time?


Solution

  • This is not a trivial problem, and I assume there will be cases where two humans would give different answers for the same text. Especially if you don't want to specify the allowed formats.

    I would recommend going with an approach similar to what PW1990 suggested, it's the most predictable and maintainable solution. However, instead of having only one regexp per semantic structure, you could have a method that tries its best to extract a property with a set of regexps.

    If you have enough data, you could also do machine learning. For that, you need to have a dataset with a mapping of each "text" to each {action}. From my experience, I would start with 100k-1M data points (mappings). You could try a bidirectional RNN that returns a number for each character, assigning it to one of the classes [ticker, price, avg, date, type, other]. Then you could parse and validate each of the substrings manually. This is straightforward if you know what you are parsing. Of course, you could try an end-to-end approach where an RNN returns you a reformatted string that you feed into a regexp. However, it significantly increases the complexity of the model and the amount of data you need.

    There was a similar StackOverflow thread here: https://stats.stackexchange.com/questions/35249/machine-learning-techniques-for-parsing-strings

    My personal recommendation is to avoid machine learning if it is not absolutely needed. ML is not a silver bullet, and it requires a lot of work to get it correctly. So it will be a project of its own. On the other hand, regular expressions are a much more straightforward solution for this problem, although I understand that it is tedious to hardcode all possible regexps. Just keep in mind that most likely, you are aiming at getting 99% of the cases.