Search code examples
regexbbedit

Truncate or trim a subdirectory path string to an arbitrary number of characters with regex


I have a text file of several thousand URLs that I need to truncate or trim with regex. I am using BBEdit as a text editor as it has a great regex find/replace function.

This is an example of one of the URLs:

https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhUk2LfEXvKMZ48tpWUR607L5y_TRn-lXyajH_tJBOeWPqNFmfU1UV7pKginB78MHnuGS-luzq-RCIj1Z6rJ2y8VE3P93gIGeN_ZMjFii1Vnb2wZMnbyLTH241UTuu8kcvMZHFii1Vnb2wZMnbyLTH241gaZGDlgWTfx4EVdAlNFncc2XZJNz0fE0-JK1iDP7WgLEJWNg/w640-h196/Oscar.png

I need to truncate/trim the longest subdirectory path, i.e., which is this:

/AVvXsEhUk2LfEXvKMZ48tpWUR607L5y_TRn-lXyajH_tJBOeWPqNFmfU1UV7pKginB78MHnuGS-luzq-RCIj1Z6rJ2y8VE3P93gIGeN_ZMjFii1Vnb2wZMnbyLTH241UTuu8kcvMZHFii1Vnb2wZMnbyLTH241gaZGDlgWTfx4EVdAlNFncc2XZJNz0fE0-JK1iDP7WgLEJWNg/

What I need to do is truncate or trim that one subdirectory path to the leading /AVvXsE and include the next 20 characters to the right.

i.e., this is what I need as a result:

/AVvXsEhUk2LfEXvKMZ48tpWUR6/

so the resulting full URL path is this:

https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhUk2LfEXvKMZ48tpWUR6/w640-h196/Oscar.png

The first six characters of the URL /AVvXsE are the same in all the URLs I need to truncate/trim. I need the next 20 characters to the right of the /AVvXsE to create unique paths because I can see that other subdirectories for the image files, i.e. w640-h196, are used for many other images.

How can I do this with Regex? Or is Regex not the best way to do this? What about sed?

Regex Fiddle: https://regex101.com/r/W2t82Z/1


Solution

  • You can use a pattern, which includes (\/AVvXsE\S{20})[^\/]*, such as:

    (?i)(https?:\/\/blogger.googleusercontent.com\/img\/.*)(\/AVvXsE\S{20})[^\/]*
    

    Assuming that you want only https://blogger.googleusercontent.com/img/ URLs.

    Example

    import re
    
    s = """https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEEXvKMZ48tpWUR607L5y_TRn-lXyajH_tJBOeWPqNFmfU1UV7pKginB78MHnuGS-luzq-RCIj1Z6rJ2y8VE3P93gIGeN_ZMjFii1Vnb2wZMnbyLTH241UTuu8kcvMZHFii1Vnb2wZMnbyLTH241gaZGDlgWTfx4EVdAlNFncc2XZJNz0fE0-JK1iDP7WgLEJWNg/w640-h196/Oscar.png"""
    p = r'(?i)(https?:\/\/blogger.googleusercontent.com\/img\/.*)(\/AVvXsE\S{20})[^\/]*'
    print(re.sub(p, r'\1\2', s))
    
    

    Prints

    https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEEXvKMZ48tpWUR607L5y_/w640-h196/Oscar.png
    
    

    Details:

    • (?i): insensitive flag (allows for all combinations of lowercases and uppercases HTTPS://, https://, etc.).
    • (https?:\/\/blogger.googleusercontent.com\/img\/.*): this capture group limits the pattern to specific URLs.
    • (\/AVvXsE\S{20}): this capture group is the part you want to keep.
    • [^\/]*: this is the part you want to get rid of.