Search code examples
pythonregexpython-re

How to use Regex to extract a string from a specific string until a specific symbol in python?


Question

Assume that I have a string like this:

example_text = 'b\'\\x08\\x13"\\\\https://www.example.com/link_1.html\\xd2\\x01`https://www.example.com/link_2.html\''

Expectation

And I want to only extract the first url, which is

output = "https://www.example.com/link_1.html"

I think using regex to find the url start from "https" and end up '\' will be a good solution.

If so, how can I write the regex pattern?

I try something like this: `

re.findall("https://([^\\\\)]+)", example_text)

output = ['www.example.com/link_1.html', 'www.example.com/link_2.html']

But then, I need to add "https://" back and choose the first item in the return.

Is there any other solution?


Solution

  • You need to tweak your regex a bit.

    What you were doing before: https://([^\\\\)]+) this matches your link but only captures the part after https:// since you used the capturing token after that.

    Updated Regex: (https\:\/\/[^\\\\)]+) this matches the link and also captures the whole token (escaped special characters to avoid errors)

    In Code:

    import re
    input = 'b\'\\x08\\x13"\\\\https://www.example.com/link_1.html\\xd2\\x01`https://www.example.com/link_2.html\''
    print(re.findall("(https\:\/\/[^\\\\)]+)", input))
    

    Output:

    ['https://www.example.com/link_1.html', "https://www.example.com/link_2.html'"]
    

    You could also use (https\:\/\/([^\\\\)]+).html) to get the link with https:// and without it as a tuple. (this also avoids the ending ' that you might get in some links)

    If you want only the first one, simply do output[0].