Search code examples
pythonregexpython-re

Find the first/last n words of a string with a maximum of 20 characters using regex


I'm trying to find any number of words at the beginning or end of a string with a maximum of 20 characters.

This is what I have right now:

s1 = "Hello,    World! This is a reallly long string"
match = re.search(r"^(\b.{0,20}\b)", s1)
print(f"'{match.group(0)}'") # 'Hello, World! This '

My problem is the extra space that it adds at the end. I believe this is because \b matches either the beginning or the end of the string but I'm not sure what to do about it.

I run into the same issue if I try to do the same with the end of the string but with a leading space instead:

s1 = "Hello,    World! This is a reallly long string"
match = re.search(r"(\b.{0,20}\b)$", s1)
print(f"'{match.group(0)}'") # ' reallly long string'

I know I can just use rstrip and lstrip to get rid of the leading/trailing whitespace but I was just wondering if there's a way to do it with regex.


Solution

  • You can use r"^(.{0,19}\S\b|)" (regex demo), \S ensuring to have a non space character on the bound. You need to decrease the number of previous characters to 19 and use | with empty string to match 0 characters if needed:

    import re
    s1 = "Hello,    World! This is a reallly long string"
    match = re.search(r"^(.{0,19}\S\b|)", s1)
    print(f"'{match.group(0)}'", len(match.group(0)))
    

    Output:

    'Hello,    World' 15
    

    For the end of string r"(|\b\S.{0,19})$" (regex demo):

    import re
    s1 = "Hello,    World! This is a reallly long string"
    match = re.search(r"(|\b\S.{0,19})$", s1)
    print(f"'{match.group(0)}'", len(match.group(0)))
    

    output:

    'reallly long string' 19
    
    why (...|)?

    to enable zeros characters, the below example would fail with ^(.{0,19}\S\b)

    import re
    s1 = "X"*21
    match = re.search(r"^(.{0,19}\S\b|)$", s1)
    print(f"'{match.group(0)}'", len(match.group(0)))
    

    output:

    '' 0