How to use Regex to extract a string from a specific string until a specific symbol in python?

Question

Assume that I have a string like this:

example_text = 'b\'\\x08\\x13"\\\\https://www.example.com/link_1.html\\xd2\\x01`https://www.example.com/link_2.html\''

Expectation

And I want to only extract the first url, which is

output = "https://www.example.com/link_1.html"

I think using regex to find the url start from "https" and end up '\' will be a good solution.

If so, how can I write the regex pattern?

I try something like this: `

re.findall("https://([^\\\\)]+)", example_text)

output = ['www.example.com/link_1.html', 'www.example.com/link_2.html']

But then, I need to add "https://" back and choose the first item in the return.

Is there any other solution?

Solution

You need to tweak your regex a bit.

What you were doing before: https://([^\\\\)]+) this matches your link but only captures the part after https:// since you used the capturing token after that.

Updated Regex: (https\:\/\/[^\\\\)]+) this matches the link and also captures the whole token (escaped special characters to avoid errors)

In Code:

import re
input = 'b\'\\x08\\x13"\\\\https://www.example.com/link_1.html\\xd2\\x01`https://www.example.com/link_2.html\''
print(re.findall("(https\:\/\/[^\\\\)]+)", input))

Output:

['https://www.example.com/link_1.html', "https://www.example.com/link_2.html'"]

You could also use (https\:\/\/([^\\\\)]+).html) to get the link with https:// and without it as a tuple. (this also avoids the ending ' that you might get in some links)

If you want only the first one, simply do output[0].