I Would like to use REGEX to match all the tab characters that appear after the first letter or number. As it's possible to see in the image below, I have a hierarchical text file that each level of category is marked with a TAB (\t
) character.
After some research I've found out the REGEX that almost fit my desire:
the Regular Expression: \b[\t]{1,}\b
The problem:
As it's possible to see in the image below, this REGEX does not select the TABs that appear after a string that finishes with a dot (1., 2., 3., 4. ...).
Does anyone know how to include in the REGEX this pattern as well?
Here is a partial text of my example:
BBHH Balanço Patrimonial
1. ATIVO Assets
1.1 CIRCULANTE
1.2 NÃO CIRCULANTE
2. PASSIVO Liabilities and Equity
3. RECEITAS
4. CUSTOS E DESPESAS
4.1 CUSTOS DE PRODUTOS VENDIDOS E SERVIÇOS
4.1.1 CUSTOS DE PRODUTOS VENDIDOS
4.1.1.1 CUSTOS DE PRODUTOS VENDIDOS
You may use negative Lookbehinds to make sure the tabs are not at the beginning of the line.
Try the following pattern:
(?<!^)(?<!\t)\t+
Demo.
Details:
(?<!^)
- Not at the beginning of the line.(?<!\t)
- Not preceded by a tab character (avoid matching tabs following the one above).\t+
- Match one or more tab characters (same as \t{1,}
).Python example:
import re
text = ("\tBBHH\tBalanço Patrimonial\n"
"\t\t\t1.\t\tATIVO\t\t\t\t\t\t\t\t\t\t\tAssets\n"
"\t\t\t\t1.1\t\tCIRCULANTE\n"
"\t\t\t\t1.2\t\tNÃO CIRCULANTE\n"
"\t\t\t2.\t\tPASSIVO\t\t\t\t\t\t\t\t\t\t\tLiabilities and Equity\n"
"\t\t\t3.\t\tRECEITAS\n"
"\t\t\t4.\t\tCUSTOS E DESPESAS\n"
"\t\t\t\t4.1\t\tCUSTOS DE PRODUTOS VENDIDOS E SERVIÇOS\n"
"\t\t\t\t\t\t4.1.1\t\tCUSTOS DE PRODUTOS VENDIDOS\n"
"\t\t\t\t\t\t\t4.1.1.1\t\t\tCUSTOS DE PRODUTOS VENDIDOS\n")
matches = re.finditer(r"(?<!^)(?<!\t)\t+", text, re.MULTILINE)
for matchNum, match in enumerate(matches, start=1):
print ("Match {matchNum} was found at pos:{start}.".
format(matchNum = matchNum, start = match.start()))
Addendum:
The pattern above will work as long as the indentation uses tabs only. If your text file might have a mix of tab and space characters used for indentation, you may use the following pattern instead:
\S+(\t+)
And in that case, you can extract the matched tabs from group 1. Or for substitution, you may use (\S+)\t+
and replace with \1
to remove the tabs (or with \1x
to replace the tabs with x
).