I am having problems figuring out the best way to strip URLs from a .txt file. I realize that regex is probably the best way to go about it but it's been a while since I did anything in Python. Not a homework question, just a personal project.
Here is a sample of the file:
738 \loch\af4\dbch\af31505\hich\f4 \u8232\'5f}{\field{*\fldinst {\rtlch\fcs1 \af4 \ltrch\fcs0 \f4\cf1\insrsid10228738 \hich\af4\dbch\af31505\loch\f4 HYPERLINK "https://archive.org/randomURL1?fref=grp_mmbr_list"}{ \rtlch\fcs1 \af4 \ltrch\fcs0 \f4\cf1\insrsid10228738 {*\datafield 00d0c9ea79f9bace118c8200aa004ba90b0200000003000000e0c9ea79f9bace118c8200aa004ba90b31505\hich\f4 \u8232\'5f}{\field{*\fldinst {\rtlch\fcs1 \af4 \ltrch\fcs0 \f4\cf1\insrsid10228738 \hich\af4\dbch\af31505\loch\f4 HYPERLINK "https://archive.org/randomURL2?fref=grp_mmbr_list"}{\rtlch\fcs1 \af4 \ltrch\fcs0 \f4\cf1\insrsid10228738 {*
As you can see, it's a mess. At least it seems that there is always a 'HYPERLINK "' before each URL and a 'fref' after so I could use the regex start of line and end of line operators.
I was thinking this:
grep ^HYPERLINK $fref testsample.txt | echo output.txt
But it's not working for me. The desired output would look like this in a new file:
link1
link2
linkn...
Update: I found out how to pull URLs and put them in a new file with this command:
grep 'https://www\.[[:alpha:]]\+\.[[:alpha:]]\+' testsample.txt > testfile2.txt
But my output looks like this:
\f4\cf1\insrsid10228738 \loch\af4\dbch\af31505\hich\f4 \u8232\'5f}{\field{*\fldinst {\rtlch\fcs1 \af4 \ltrch\fcs0 \f4\cf1\insrsid10228738 \hich\af4\dbch\af31505\loch\f4 HYPERLINK "httjps://archive.org/randomURL1?fref=grp_mmbr_list"}{ \loch\af4\dbch\af31505\hich\f4 \u8232\'5f}{\field{*\fldinst {\rtlch\fcs1 \af4 \ltrch\fcs0 \f4\cf1\insrsid10228738 \hich\af4\dbch\af31505\loch\f4 HYPERLINK "httjps://archive.org/randomURL1?fref=grp_mmbr_list"}{\rtlch\fcs1 \af4 \ltrch\fcs0 \f4\cf1\insrsid10228738 \loch\af4\dbch\af31505\hich\f4 \u8232\'5f}{\field{*\fldinst {\rtlch\fcs1 \af4 \ltrch\fcs0 \f4\cf1\insrsid10228738 \hich\af4\dbch\af31505\loch\f4 HYPERLINK "httjps://archive.org/randomURL2?fref=grp_mmbr_list"}{
It seems like it's pulling the whole line and not just the URL. Any help with confuguring the 'end of line' parameter would be very much appreciated.
Solved
grep -Eo '\"https?:\/\/[^"]+\"' testsample.txt > testfile2.txt
To extract all links into a new file with grep
command:
grep -Po '\"\Khttps?:\/\/[^"]+(?=\")' testsample.txt > testfile2.txt
Now, testfile2.txt
file should contain the following:
https://archive.org/randomURL1?fref=grp_mmbr_list
https://archive.org/randomURL2?fref=grp_mmbr_list
Note: if -P
option isn't supported on your side, use -E
to allow extended regexp expressions:
grep -Eo '\"https?:\/\/[^"]+\"' testsample.txt > testfile2.txt
To remove all links from the initial file (in place) use sed
command
with -ri
options:
sed -ri 's/\"https?:\/\/[^"]+\"//g' /tmp/testsample.txt
Alternative solution using re.sub()
function(used a test string instead of file):
import re
s = '''
738 \loch\af4\dbch\af31505\hich\f4 \u8232\'5f}{\field{*\fldinst {\rtlch\fcs1 \af4 \ltrch\fcs0 \f4\cf1\insrsid10228738 \hich\af4\dbch\af31505\loch\f4 HYPERLINK "https://archive.org/randomURL1?fref=grp_mmbr_list"}{ \rtlch\fcs1 \af4 \ltrch\fcs0 \f4\cf1\insrsid10228738 {*\datafield 00d0c9ea79f9bace118c8200aa004ba90b0200000003000000e0c9ea79f9bace118c8200aa004ba90b31505\hich\f4 \u8232\'5f}{\field{*\fldinst {\rtlch\fcs1 \af4 \ltrch\fcs0 \f4\cf1\insrsid10228738 \hich\af4\dbch\af31505\loch\f4 HYPERLINK "https://archive.org/randomURL2?fref=grp_mmbr_list"}{\rtlch\fcs1 \af4 \ltrch\fcs0 \f4\cf1\insrsid10228738 {*
'''
result = re.sub(r'\"https?:\/\/[^"]+\"', '', s)
print(repr(result))
The output:
"\n738 \\loch\x07f4\\dbch\x07f31505\\hich\x0c4 舲'5f}{\x0cield{*\x0cldinst {\rtlch\x0ccs1 \x07f4 \\ltrch\x0ccs0 \x0c4\\cf1\\insrsid10228738 \\hich\x07f4\\dbch\x07f31505\\loch\x0c4 HYPERLINK }{ \rtlch\x0ccs1 \x07f4 \\ltrch\x0ccs0 \x0c4\\cf1\\insrsid10228738 {*\\datafield 00d0c9ea79f9bace118c8200aa004ba90b0200000003000000e0c9ea79f9bace118c8200aa004ba90b31505\\hich\x0c4 舲'5f}{\x0cield{*\x0cldinst {\rtlch\x0ccs1 \x07f4 \\ltrch\x0ccs0 \x0c4\\cf1\\insrsid10228738 \\hich\x07f4\\dbch\x07f31505\\loch\x0c4 HYPERLINK }{\rtlch\x0ccs1 \x07f4 \\ltrch\x0ccs0 \x0c4\\cf1\\insrsid10228738 {*\n"