Search code examples
regexregex-grouppython-reregexp-replace

How to match all tab characters after first letter or number?


I Would like to use REGEX to match all the tab characters that appear after the first letter or number. As it's possible to see in the image below, I have a hierarchical text file that each level of category is marked with a TAB (\t) character.

After some research I've found out the REGEX that almost fit my desire:

the Regular Expression: \b[\t]{1,}\b

The problem:

As it's possible to see in the image below, this REGEX does not select the TABs that appear after a string that finishes with a dot (1., 2., 3., 4. ...).

REGEX match problem

Does anyone know how to include in the REGEX this pattern as well?

Here is a partial text of my example:

    BBHH    Balanço Patrimonial
            1.      ATIVO                                           Assets
                1.1     CIRCULANTE
                1.2     NÃO CIRCULANTE
            2.      PASSIVO                                         Liabilities and Equity
            3.      RECEITAS
            4.      CUSTOS E DESPESAS
                4.1     CUSTOS DE PRODUTOS VENDIDOS E SERVIÇOS
                        4.1.1       CUSTOS DE PRODUTOS VENDIDOS
                            4.1.1.1         CUSTOS DE PRODUTOS VENDIDOS

Solution

  • You may use negative Lookbehinds to make sure the tabs are not at the beginning of the line.

    Try the following pattern:

    (?<!^)(?<!\t)\t+
    

    Demo.

    Details:

    • (?<!^) - Not at the beginning of the line.
    • (?<!\t) - Not preceded by a tab character (avoid matching tabs following the one above).
    • \t+ - Match one or more tab characters (same as \t{1,}).

    Python example:

    import re
    
    text = ("\tBBHH\tBalanço Patrimonial\n"
        "\t\t\t1.\t\tATIVO\t\t\t\t\t\t\t\t\t\t\tAssets\n"
        "\t\t\t\t1.1\t\tCIRCULANTE\n"
        "\t\t\t\t1.2\t\tNÃO CIRCULANTE\n"
        "\t\t\t2.\t\tPASSIVO\t\t\t\t\t\t\t\t\t\t\tLiabilities and Equity\n"
        "\t\t\t3.\t\tRECEITAS\n"
        "\t\t\t4.\t\tCUSTOS E DESPESAS\n"
        "\t\t\t\t4.1\t\tCUSTOS DE PRODUTOS VENDIDOS E SERVIÇOS\n"
        "\t\t\t\t\t\t4.1.1\t\tCUSTOS DE PRODUTOS VENDIDOS\n"
        "\t\t\t\t\t\t\t4.1.1.1\t\t\tCUSTOS DE PRODUTOS VENDIDOS\n")
    
    matches = re.finditer(r"(?<!^)(?<!\t)\t+", text, re.MULTILINE)
    
    for matchNum, match in enumerate(matches, start=1):
        print ("Match {matchNum} was found at pos:{start}.".
               format(matchNum = matchNum, start = match.start()))
    

    Try it online.


    Addendum:

    The pattern above will work as long as the indentation uses tabs only. If your text file might have a mix of tab and space characters used for indentation, you may use the following pattern instead:

    \S+(\t+)
    

    And in that case, you can extract the matched tabs from group 1. Or for substitution, you may use (\S+)\t+ and replace with \1 to remove the tabs (or with \1x to replace the tabs with x).