Search code examples
pythonextract

Python regular expression to extract string from python dataframe


I coded a PDF extraction through Python, and reading it into Python string. I am trying to extract data from different PDFs, and the structure for the addresses on each document is slightly different. Here is the example:

Alamat :Menara Bank Mega, Lantai 24, Jl. Kapten P Tendean
Kav. 12-14A

Alamat :JL USMAN NO. 42, RT 8/4, KEL. KELAPA DUA WETAN,
KEC. PASAR REBO, JAKARTA TIMUR

Alamat :JL. HR. RASUNA SAID KAV 1-2, GRAHA IRAMA
LANTAI 6 KUNINGAN TIMUR- SETIABUDI JAKARTA
SELATAN

Alamat :GD. GRAHA PRATAMA LT.10, JL. MT. HARYONO
KAV.15, TEBET

AHUAlamat :GEDUNG BERITASATU PLAZA LT. 8, JL. JEND. GATOT
SUBROTO KAV. 35-36

I expect to extract everything after the ":". Is there a regular expression to find all of the things on the above?


Solution

  • Using re.search() is one possible approach:

    (?:Alamat|AHUAlamat): is a non-capturing group which matches either "Alamat" or "AHUAlamat".
    \s*: matches any number of whitespace characters.
    :: matches the colon character.
    (.*): is a capturing group which matches any series of characters except newlines.

    import re
    
    data_str = """Alamat :Menara Bank Mega, Lantai 24, Jl. Kapten P Tendean Kav. 12-14A
    Alamat :JL USMAN NO. 42, RT 8/4, KEL. KELAPA DUA WETAN, KEC. PASAR REBO, JAKARTA TIMUR
    Alamat :JL. HR. RASUNA SAID KAV 1-2, GRAHA IRAMA LANTAI 6 KUNINGAN TIMUR- SETIABUDI Jakarta SELATAN
    Alamat :GD. GRAHA PRATAMA LT.10, JL. MT. HARYONO KAV.15, TEBET
    AHUAlamat :GEDUNG BERITASATU PLAZA LT. 8, JL. JEND. GATOT SUBROTO KAV. 35-36
    """
    
    pattern = r'(?:Alamat|AHUAlamat)\s*:(.*)'
    addresses = data_str.splitlines()
    
    for address in addresses:
        match = re.search(pattern, address)
        if match:
            print(match.group(1).strip())
    

    Note: If every line of string have the same structure with : then split() alone can do the job:

    lst_data = data_str.splitlines()
    addresses = [address.split(':')[-1] for address in lst_data]
    print(*addresses, sep='\n')
    

    Menara Bank Mega, Lantai 24, Jl. Kapten P Tendean Kav. 12-14A
    JL USMAN NO. 42, RT 8/4, KEL. KELAPA DUA WETAN, KEC. PASAR REBO, JAKARTA TIMUR
    JL. HR. RASUNA SAID KAV 1-2, GRAHA IRAMA LANTAI 6 KUNINGAN TIMUR- SETIABUDI Jakarta SELATAN
    GD. GRAHA PRATAMA LT.10, JL. MT. HARYONO KAV.15, TEBET
    GEDUNG BERITASATU PLAZA LT. 8, JL. JEND. GATOT SUBROTO KAV. 35-36