regex regex-group python-re regexp-replace

How to match all tab characters after first letter or number?

I Would like to use REGEX to match all the tab characters that appear after the first letter or number. As it's possible to see in the image below, I have a hierarchical text file that each level of category is marked with a TAB (\t) character.

After some research I've found out the REGEX that almost fit my desire:

the Regular Expression: \b[\t]{1,}\b

The problem:

As it's possible to see in the image below, this REGEX does not select the TABs that appear after a string that finishes with a dot (1., 2., 3., 4. ...).

Does anyone know how to include in the REGEX this pattern as well?

Here is a partial text of my example:

    BBHH    Balanço Patrimonial
            1.      ATIVO                                           Assets
                1.1     CIRCULANTE
                1.2     NÃO CIRCULANTE
            2.      PASSIVO                                         Liabilities and Equity
            3.      RECEITAS
            4.      CUSTOS E DESPESAS
                4.1     CUSTOS DE PRODUTOS VENDIDOS E SERVIÇOS
                        4.1.1       CUSTOS DE PRODUTOS VENDIDOS
                            4.1.1.1         CUSTOS DE PRODUTOS VENDIDOS

Solution

You may use negative Lookbehinds to make sure the tabs are not at the beginning of the line.

Try the following pattern:

(?<!^)(?<!\t)\t+

Demo.

Details:

(?<!^) - Not at the beginning of the line.
(?<!\t) - Not preceded by a tab character (avoid matching tabs following the one above).
\t+ - Match one or more tab characters (same as \t{1,}).

Python example:

import re

text = ("\tBBHH\tBalanço Patrimonial\n"
    "\t\t\t1.\t\tATIVO\t\t\t\t\t\t\t\t\t\t\tAssets\n"
    "\t\t\t\t1.1\t\tCIRCULANTE\n"
    "\t\t\t\t1.2\t\tNÃO CIRCULANTE\n"
    "\t\t\t2.\t\tPASSIVO\t\t\t\t\t\t\t\t\t\t\tLiabilities and Equity\n"
    "\t\t\t3.\t\tRECEITAS\n"
    "\t\t\t4.\t\tCUSTOS E DESPESAS\n"
    "\t\t\t\t4.1\t\tCUSTOS DE PRODUTOS VENDIDOS E SERVIÇOS\n"
    "\t\t\t\t\t\t4.1.1\t\tCUSTOS DE PRODUTOS VENDIDOS\n"
    "\t\t\t\t\t\t\t4.1.1.1\t\t\tCUSTOS DE PRODUTOS VENDIDOS\n")

matches = re.finditer(r"(?<!^)(?<!\t)\t+", text, re.MULTILINE)

for matchNum, match in enumerate(matches, start=1):
    print ("Match {matchNum} was found at pos:{start}.".
           format(matchNum = matchNum, start = match.start()))

Try it online.

Addendum:

The pattern above will work as long as the indentation uses tabs only. If your text file might have a mix of tab and space characters used for indentation, you may use the following pattern instead:

\S+(\t+)

And in that case, you can extract the matched tabs from group 1. Or for substitution, you may use (\S+)\t+ and replace with \1 to remove the tabs (or with \1x to replace the tabs with x).