I have refrained from posting a question here because I wanted to do my bit to find myself a solution to this issue. But unfortunately, after entire nights of searching and reading different Regex articles and docs I couldn't find an answer.
I have written a script that spills out the comment from PDF that have been converted from AutoCAD. Problem is, which drawing has a different pattern of the asset tag information.
Our convention is 99XX9999
(two-numbers, two-letters, four-numbers).
Some drawings preserve that pattern, others don't. We can find things like 99XX(space)9999
OR 99(space)XX(space)9999
, etc.
That part of the problem I resolved it, but there's another variant that I can't wrap my head around as per below:
'''
{'/Border': [0, 0, 0], '/Contents': '0811', '/F': 64, '/NM': 'b4499c47-d2d2-4c03-b13c-ec3b7b332ec3', '/P': IndirectObject(52, 0), '/Rect': [714, 304, 698, 314], '/Subtype': '/Square', '/T': 'AutoCAD SHX Text'}-------------------------------------------------------------------------------- {'/Border': [0, 0, 0], '/Contents': '29', '/F': 64, '/NM': 'b518c663-42eb-4861-a717-a00118d61fd2', '/P': IndirectObject(52, 0), '/Rect': [206, 369, 195, 378], '/Subtype': '/Square', '/T': 'AutoCAD SHX Text'}-------------------------------------------------------------------------------- {'/Border': [0, 0, 0], '/Contents': 'HV', '/F': 64, '/NM': '1db0a6b1-6aee-4a2d-bacf-2a996680cdcb', '/P': IndirectObject(52, 0), '/Rect': [212, 369, 201, 378], '/Subtype': '/Square', '/T': 'AutoCAD SHX Text'}-------------------------------------------------------------------------------- {'/Border': [0, 0, 0], '/Contents': '0832', '/F': 64, '/NM': '0250033f-5bc0-4d46-879a-a1ae0147352d', '/P': IndirectObject(52, 0), '/Rect': [212, 365, 195, 374], '/Subtype': '/Square', '/T': 'AutoCAD SHX Text'}-------------------------------------------------------------------------------- {'/Border': [0, 0, 0], '/Contents': '29', '/F': 64, '/NM': '7b372206-1b18-4c9f-813d-4d73ca52be40', '/P': IndirectObject(52, 0), '/Rect': [140, 392, 129, 401], '/Subtype': '/Square', '/T': 'AutoCAD SHX Text'}-------------------------------------------------------------------------------- {'/Border': [0, 0, 0], '/Contents': 'HV', '/F': 64, '/NM': 'bdc97ccd-ee7c-406a-a06c-1649a5f5f712', '/P': IndirectObject(52, 0), '/Rect': [146, 392, 135, 401], '/Subtype': '/Square', '/T': 'AutoCAD SHX Text'}-------------------------------------------------------------------------------- {'/Border': [0, 0, 0], '/Contents': '0824', '/F': 64, '/NM': '9f434537-57a3-4c40-bb8b-a0df6ea087aa', '/P': IndirectObject(52, 0), '/Rect': [146, 388, 129, 397], '/Subtype': '/Square', '/T': 'AutoCAD SHX Text'}-------------------------------------------------------------------------------- {'/Border': [0, 0, 0], '/Contents': '%%C25', '/F': 64, '/NM': 'a2ace5cb-21be-4df7-b541-ce87a8ee81bc', '/P': IndirectObject(52, 0), '/Rect': [145, 379, 132, 388], '/Subtype': '/Square', '/T': 'AutoCAD SHX Text'}-------------------------------------------------------------------------------- {'/Border': [0, 0, 0], '/Contents': 'APOUTO RD', '/F': 64, '/NM': '948a363f-989d-4d00-afa1-a8bc45ad9729', '/P': IndirectObject(52, 0), '/Rect': [1162, 355, 1136, 364], '/Subtype': '/Square', '/T': 'AutoCAD SHX Text'}-------------------------------------------------------------------------------- {'/Border': [0, 0, 0], '/Contents': 'HORTICULTURAL WATER', '/F': 64, '/NM': '5a6cedb6-cdf0-4784-817f-c17f1de58f5a', '/P': IndirectObject(52, 0), '/Rect': [1171, 350, 1126, 358], '/Subtype': '/Square', '/T': 'AutoCAD SHX Text'}-------------------------------------------------------------------------------- {'/Border': [0, 0, 0], '/Contents': '%%C101', '/F': 64, '/NM': '2ee5947a-9062-415f-a242-f4b7b2a6441c', '/P': IndirectObject(52, 0), '/Rect': [767, 428, 758, 443], '/Subtype': '/Square', '/T': 'AutoCAD SHX Text'}--------------------------------------------------------------------------------
'''
We can see '/Contents': '29'...then '/Contents': 'HV'...then '/Contents': '0832'...
I have tried multiple variations of this approach (?<='/Contents':\s')([0-9]{2})(?=')
on regex101.com to no avail. I could only capture the first two digits.
In my logic, I should find a way to have multiple 'lookaround' but couldn't achieve that.
Eventually, the regex code should the above plus the ones I already go it ([A-Z]?[0-9]{2,3}\s?[A-Z]{2,3}\s?[0-9]{3,4})
.
The line break happens after the three parts of the pattern happen, according to the regex101.com site (see picture).
Just to be clear, I am talking about 100s of drawings with 100s of lines like that each .
Please, any contribution is appreciated. If not regex itself, at least a direction I should take would be nice.
Try this one.
s="{'/Border': [0, 0, 0], '/Contents': '0811', '/F': 64, '/NM': 'b4499c47-d2d2-4c03-b13c-ec3b7b332ec3', '/P': IndirectObject(52, 0), '/Rect': [714, 304, 698, 314], '/Subtype': '/Square', '/T': 'AutoCAD SHX Text'}-------------------------------------------------------------------------------- {'/Border': [0, 0, 0], '/Contents': '29', '/F': 64, '/NM': 'b518c663-42eb-4861-a717-a00118d61fd2', '/P': IndirectObject(52, 0), '/Rect': [206, 369, 195, 378], '/Subtype': '/Square', '/T': 'AutoCAD SHX Text'}-------------------------------------------------------------------------------- {'/Border': [0, 0, 0], '/Contents': 'HV', '/F': 64, '/NM': '1db0a6b1-6aee-4a2d-bacf-2a996680cdcb', '/P': IndirectObject(52, 0), '/Rect': [212, 369, 201, 378], '/Subtype': '/Square', '/T': 'AutoCAD SHX Text'}-------------------------------------------------------------------------------- {'/Border': [0, 0, 0], '/Contents': '0832', '/F': 64, '/NM': '0250033f-5bc0-4d46-879a-a1ae0147352d', '/P': IndirectObject(52, 0), '/Rect': [212, 365, 195, 374], '/Subtype': '/Square', '/T': 'AutoCAD SHX Text'}-------------------------------------------------------------------------------- {'/Border': [0, 0, 0], '/Contents': '29', '/F': 64, '/NM': '7b372206-1b18-4c9f-813d-4d73ca52be40', '/P': IndirectObject(52, 0), '/Rect': [140, 392, 129, 401], '/Subtype': '/Square', '/T': 'AutoCAD SHX Text'}-------------------------------------------------------------------------------- {'/Border': [0, 0, 0], '/Contents': 'HV', '/F': 64, '/NM': 'bdc97ccd-ee7c-406a-a06c-1649a5f5f712', '/P': IndirectObject(52, 0), '/Rect': [146, 392, 135, 401], '/Subtype': '/Square', '/T': 'AutoCAD SHX Text'}-------------------------------------------------------------------------------- {'/Border': [0, 0, 0], '/Contents': '0824', '/F': 64, '/NM': '9f434537-57a3-4c40-bb8b-a0df6ea087aa', '/P': IndirectObject(52, 0), '/Rect': [146, 388, 129, 397], '/Subtype': '/Square', '/T': 'AutoCAD SHX Text'}-------------------------------------------------------------------------------- {'/Border': [0, 0, 0], '/Contents': '%%C25', '/F': 64, '/NM': 'a2ace5cb-21be-4df7-b541-ce87a8ee81bc', '/P': IndirectObject(52, 0), '/Rect': [145, 379, 132, 388], '/Subtype': '/Square', '/T': 'AutoCAD SHX Text'}-------------------------------------------------------------------------------- {'/Border': [0, 0, 0], '/Contents': 'APOUTO RD', '/F': 64, '/NM': '948a363f-989d-4d00-afa1-a8bc45ad9729', '/P': IndirectObject(52, 0), '/Rect': [1162, 355, 1136, 364], '/Subtype': '/Square', '/T': 'AutoCAD SHX Text'}-------------------------------------------------------------------------------- {'/Border': [0, 0, 0], '/Contents': 'HORTICULTURAL WATER', '/F': 64, '/NM': '5a6cedb6-cdf0-4784-817f-c17f1de58f5a', '/P': IndirectObject(52, 0), '/Rect': [1171, 350, 1126, 358], '/Subtype': '/Square', '/T': 'AutoCAD SHX Text'}-------------------------------------------------------------------------------- {'/Border': [0, 0, 0], '/Contents': '%%C101', '/F': 64, '/NM': '2ee5947a-9062-415f-a242-f4b7b2a6441c', '/P': IndirectObject(52, 0), '/Rect': [767, 428, 758, 443], '/Subtype': '/Square', '/T': 'AutoCAD SHX Text'}--------------------------------------------------------------------------------}"
A=B=C=''
for i in s.split("/Contents': '"):
e=i[ 0 : i.index(",")-1]
if e.isdigit() and len(e) == 2 :
A=e
if e.isalpha() and len(e) == 2 :
B=e
if e.isdigit() and len(e) == 4 :
C=e
print(A+B+C)
A=B=C=''
[Output]:
0811
29HV0832
29HV0824