I coded a PDF extraction through Python, and reading it into Python string. I am trying to extract data from different PDFs, and the structure for the addresses on each document is slightly different. Here is the example:
Alamat :Menara Bank Mega, Lantai 24, Jl. Kapten P Tendean
Kav. 12-14A
Alamat :JL USMAN NO. 42, RT 8/4, KEL. KELAPA DUA WETAN,
KEC. PASAR REBO, JAKARTA TIMUR
Alamat :JL. HR. RASUNA SAID KAV 1-2, GRAHA IRAMA
LANTAI 6 KUNINGAN TIMUR- SETIABUDI JAKARTA
SELATAN
Alamat :GD. GRAHA PRATAMA LT.10, JL. MT. HARYONO
KAV.15, TEBET
AHUAlamat :GEDUNG BERITASATU PLAZA LT. 8, JL. JEND. GATOT
SUBROTO KAV. 35-36
I expect to extract everything after the ":". Is there a regular expression to find all of the things on the above?
Using re.search()
is one possible approach:
(?:Alamat|AHUAlamat)
: is a non-capturing group which matches either "Alamat" or "AHUAlamat".
\s*
: matches any number of whitespace characters.
:
: matches the colon character.
(.*)
: is a capturing group which matches any series of characters except newlines.
import re
data_str = """Alamat :Menara Bank Mega, Lantai 24, Jl. Kapten P Tendean Kav. 12-14A
Alamat :JL USMAN NO. 42, RT 8/4, KEL. KELAPA DUA WETAN, KEC. PASAR REBO, JAKARTA TIMUR
Alamat :JL. HR. RASUNA SAID KAV 1-2, GRAHA IRAMA LANTAI 6 KUNINGAN TIMUR- SETIABUDI Jakarta SELATAN
Alamat :GD. GRAHA PRATAMA LT.10, JL. MT. HARYONO KAV.15, TEBET
AHUAlamat :GEDUNG BERITASATU PLAZA LT. 8, JL. JEND. GATOT SUBROTO KAV. 35-36
"""
pattern = r'(?:Alamat|AHUAlamat)\s*:(.*)'
addresses = data_str.splitlines()
for address in addresses:
match = re.search(pattern, address)
if match:
print(match.group(1).strip())
Note: If every line of string have the same structure with :
then split()
alone can do the job:
lst_data = data_str.splitlines()
addresses = [address.split(':')[-1] for address in lst_data]
print(*addresses, sep='\n')
Menara Bank Mega, Lantai 24, Jl. Kapten P Tendean Kav. 12-14A
JL USMAN NO. 42, RT 8/4, KEL. KELAPA DUA WETAN, KEC. PASAR REBO, JAKARTA TIMUR
JL. HR. RASUNA SAID KAV 1-2, GRAHA IRAMA LANTAI 6 KUNINGAN TIMUR- SETIABUDI Jakarta SELATAN
GD. GRAHA PRATAMA LT.10, JL. MT. HARYONO KAV.15, TEBET
GEDUNG BERITASATU PLAZA LT. 8, JL. JEND. GATOT SUBROTO KAV. 35-36