I'm trying to extract tokens/part of tokens that have numeric/alphanumeric characters that have a length greater than 8 from the text.
Example:
text = 'https://stackoverflow.com/questions/59800512/ 510557XXXXXX2302 Normal words 1601371803 NhLw6NlR0EksRWkLddEo7NiEvrg https://www.google.com/search?q=some+google+search&oq=some+google+search&aqs=chrome..69i57j0i22i30l8j0i390.4672j0j7&sourceid=chrome&ie=UTF-8'
The expected output would be :
59800512 510557XXXXXX2302 1601371803 NhLw6NlR0EksRWkLddEo7NiEvrg 69i57j0i22i30l8j0i390 4672j0j7
I have tried using the regular expression : ((\d+)|([A-Za-z]+\d)[\dA-Za-z]*)
based on the answer Python Alphanumeric Regex. I got the following results :
[match for match in re.findall(r"((\d+)|([A-Za-z]+\d)[\dA-Za-z]*)",text)]
Output :
[('59800512', '59800512', ''),
('510557', '510557', ''),
('XXXXXX2302', '', 'XXXXXX2'),
('1601371803', '1601371803', ''),
('NhLw6NlR0EksRWkLddEo7NiEvrg', '', 'NhLw6'),
('69', '69', ''),
('i57j0i22i30l8j0i390', '', 'i5'),
('4672', '4672', ''),
('j0j7', '', 'j0'),
('8', '8', '')]
I'm getting a tuple of matching groups for each matching token.
It is possible to filter these tuples again. But I'm trying to make the code as efficient and pythonic as possible.
Could anyone suggest a solution? It need not be based on regular expressions.
Thanks in advance
Edit : I expect alphanumeric values of length equal to or greater than 8
I came up with:
\b[A-Za-z]{,7}\d[A-Za-z\d]{7,}\b
See an online demo
\b
- Word boundary.[A-Za-z]{,7}
- 0-7 times a alphachar.\d
- A single digit.[A-Za-z\d]{7,}
- 7+ times an alphanumeric char.\b
- Word boundary.Some sample code:
import re
s = "https://stackoverflow.com/questions/59800512/ 510557XXXXXX2302 Normal words 1601371803 NhLw6NlR0EksRWkLddEo7NiEvrg https://www.google.com/search?q=some+google+search&oq=some+google+search&aqs=chrome..69i57j0i22i30l8j0i390.4672j0j7&sourceid=chrome&ie=UTF-8"
result = re.findall(r'\b[A-Za-z]{,7}\d[A-Za-z\d]{7,}\b', s)
print(result)
Prints:
['59800512', '510557XXXXXX2302', '1601371803', 'NhLw6NlR0EksRWkLddEo7NiEvrg', '69i57j0i22i30l8j0i390', '4672j0j7']
You could opt to match case-insensitive with:
(?i)\b[a-z]{,7}\d[a-z\d]{7,}\b