Search code examples
regexpython-regex

python Regex for mixed alpha numerica data and special characters


I am trying to write a regex for the following use cases in a one line regex.

ex:

Table 1-2: this is a sample text 2 and some hyphen - (abbreviation)

Table 1: this is a sample text 2 and some hyphen - (abbreviation)

Table 1 this is a sample text 2 and some hyphen - (abbreviation)

Table 1-2-1: this is a sample text 2 and some hyphen - (abbreviation)

similarly

Figure 1-2: this is a sample text 2 and some hyphen - (abbreviation)

Figure 1: this is a sample text 2 and some hyphen - (abbreviation)

Figure 1 this is a sample text 2 and some hyphen - (abbreviation)

Figure 1-2-1: this is a sample text 2 and some hyphen - (abbreviation)

i tried the following approach

import re
re.sub(r'^Table ()|([0-9]+[-][0-9]+|[0-9]+|[0-9 ]+)', " ", text_to_search)
re.sub(r'^Figure ()|([0-9]+[-][0-9]+|[0-9]+|[0-9 ]+)', " ", text_to_search)

Well this is not so good approach, also looking to eliminate the dependency of Table and Figure. Please do suggest. Thanks in advance for your time.

Expected Output:

['Table', '1-2:', 'this is a sample text 2 and some hyphen - (abbreviation)']
['Table', '1:', 'this is a sample text 2 and some hyphen - (abbreviation)']
['Table', '1', 'this is a sample text 2 and some hyphen - (abbreviation)']
['Table', '1-2-1:', 'this is a sample text 2 and some hyphen - (abbreviation)']

['Figure', '1-2:', 'this is a sample text 2 and some hyphen - (abbreviation)']
['Figure', '1:', 'this is a sample text 2 and some hyphen - (abbreviation)']
['Figure', '1', 'this is a sample text 2 and some hyphen - (abbreviation)']
['Figure', '1-2-1:', 'this is a sample text 2 and some hyphen - (abbreviation)']

I am looking for the value available at list[2]


Solution

  • This will work to match everything listed in your "Expected Output"

    pattern = re.compile(r'^(\w+)\s([-0-9]+:?)\s(.*\))$')
    matches = re.findall(pattern, text_to_search)
    print(matches)
    

    However, if what you really want is ['Table', '1', 'this is a sample text 2 and some hyphen - (abbreviation)'] or ['Figure', '1', 'this is a sample text 2 and some hyphen - (abbreviation)']

    (I'm guessing this is what "I am looking for the value available at list[2]" means)

    then this pattern should work...

    pattern = re.compile(r'^(\w+)\s(\d+)\s(.*\))$')