Search code examples
sqlsplitdata-munging

Split text using any suggested method


I have a plain text like this:

Cart ID: A3N42M / Copy: A3N42P PO: 5000021337 Invoice: 3110021337
Cart ID: A3N3ZW / Copy: A3N3ZX/ PO: 5000021335 Invoice: 3110021335 
Cart ID: A3N3ZL / Copy: A3N3ZM PO: 5000021336 Invoice: 3110021336 
Original: A3N444 / Copy: A3N445 PO: 5000021340 Invoice: 3130021340
Original: A3N44C / Copy: A3N44D PO: 5000021341 Invoice: 3130021341
Original: A3N44G / Copy: A3N44H PO: 5000021342 Invoice: 3130021342
Cart ID: A3N3ZZ / Copy: A3N428 PO: A3N3ZZ01
Cart ID: A3N3ZQ / Copy: A3N3ZV PO: A3N3ZQ01
Cart ID: A3N336 / Copy: A3N337 PO: A3N33601
E3M49D / E3M49Q - PO: 4620028049
E3M49N / E3M49X
E3M49P / E3M49Y

And I need to split the cart IDs from that text, and end up in something like this:

A3N42M
A3N42P
A3N3ZW
...
E3M49N
...

These IDs are always 6-digit length, and always starts with A, E or P (AXXXXX, EXXXXX, PXXXXX, etc...).

Is there any way (using any script or any program language) that I can achieve this?

This plain text is currently on a spreadsheet table, and I need to separate these IDs to use in a SQL query for later, thanks!


Solution

  • Regex Data-Munging


    You can perform this via regex substitutions (or find and replace). I happened to do this in python, but you could use the same regex patterns ('^[^/]*?: ', '/.*') in any language or text editor that supports them.

    Regex Explanation

    1. '^[^/]*?: ' - start and beginning of string (first ^), match multiple non-\ characters in a non greedy way ([^/]*?]), a :, then a .
    2. '/.*' - match all /, then multiple any character (.)

    Processing Example (in Python)

    import re
    
    text = '''
    Cart ID: A3N42M / Copy: A3N42P PO: 5000021337 Invoice: 3110021337
    Cart ID: A3N3ZW / Copy: A3N3ZX/ PO: 5000021335 Invoice: 3110021335 
    Cart ID: A3N3ZL / Copy: A3N3ZM PO: 5000021336 Invoice: 3110021336 
    Original: A3N444 / Copy: A3N445 PO: 5000021340 Invoice: 3130021340
    Original: A3N44C / Copy: A3N44D PO: 5000021341 Invoice: 3130021341
    Original: A3N44G / Copy: A3N44H PO: 5000021342 Invoice: 3130021342
    Cart ID: A3N3ZZ / Copy: A3N428 PO: A3N3ZZ01
    Cart ID: A3N3ZQ / Copy: A3N3ZV PO: A3N3ZQ01
    Cart ID: A3N336 / Copy: A3N337 PO: A3N33601
    E3M49D / E3M49Q - PO: 4620028049
    E3M49N / E3M49X
    E3M49P / E3M49Y
    '''
    
    text = re.sub('^[^/]*?: ([]*?)', '', text, flags=re.MULTILINE)
    text = re.sub('/.*', '', text)
    
    print text
    

    A3N42M 
    A3N3ZW 
    A3N3ZL 
    A3N444 
    A3N44C 
    A3N44G 
    A3N3ZZ 
    A3N3ZQ 
    A3N336 
    E3M49D 
    E3M49N 
    E3M49P
    

    EDIT


    Updated Regex Explanation

    Updated regex per author's request.

    1. match any character (.*?) (non-greedy), followed by a capture group of ((...)) of a character class ([AEP]) followed by at least one character class ([0-9]+) followed by four word characters (\w{4}), followed by another capture group of any character ((.*))
    2. All of this is matched and replaced with the capture group variables with a newline in between (\1\n\2), essentially splitting rows where IDs occur twice
    3. Repeat steps with only the first capture group to handle the newlines containing the second ID

    Updated Processing

    text = re.sub(r'.*?([AEP][0-9]+\w{4})(.*)', r'\1\n\2', text, flags=re.MULTILINE)
    text = re.sub(r'.*?([AEP][0-9]+\w{4}).*', r'\1', text, flags=re.MULTILINE)
    
    print text
    

    A3N42M
    A3N42P
    A3N3ZW
    A3N3ZX
    A3N3ZL
    A3N3ZM
    A3N444
    A3N445
    A3N44C
    A3N44D
    A3N44G
    A3N44H
    A3N3ZZ
    A3N428
    A3N3ZQ
    A3N3ZV
    A3N336
    A3N337
    E3M49D
    E3M49Q
    E3M49N
    E3M49X
    E3M49P
    E3M49Y