Search code examples
pythonregexgrepsanitization

Using Grep & Regex to strip URL strings from .txt


I am having problems figuring out the best way to strip URLs from a .txt file. I realize that regex is probably the best way to go about it but it's been a while since I did anything in Python. Not a homework question, just a personal project.

Here is a sample of the file:

738 \loch\af4\dbch\af31505\hich\f4 \u8232\'5f}{\field{*\fldinst {\rtlch\fcs1 \af4 \ltrch\fcs0 \f4\cf1\insrsid10228738 \hich\af4\dbch\af31505\loch\f4 HYPERLINK "https://archive.org/randomURL1?fref=grp_mmbr_list"}{ \rtlch\fcs1 \af4 \ltrch\fcs0 \f4\cf1\insrsid10228738 {*\datafield 00d0c9ea79f9bace118c8200aa004ba90b0200000003000000e0c9ea79f9bace118c8200aa004ba90b31505\hich\f4 \u8232\'5f}{\field{*\fldinst {\rtlch\fcs1 \af4 \ltrch\fcs0 \f4\cf1\insrsid10228738 \hich\af4\dbch\af31505\loch\f4 HYPERLINK "https://archive.org/randomURL2?fref=grp_mmbr_list"}{\rtlch\fcs1 \af4 \ltrch\fcs0 \f4\cf1\insrsid10228738 {*

As you can see, it's a mess. At least it seems that there is always a 'HYPERLINK "' before each URL and a 'fref' after so I could use the regex start of line and end of line operators.

I was thinking this:

grep ^HYPERLINK $fref testsample.txt | echo output.txt

But it's not working for me. The desired output would look like this in a new file:

link1
link2
linkn...

Update: I found out how to pull URLs and put them in a new file with this command:

grep 'https://www\.[[:alpha:]]\+\.[[:alpha:]]\+' testsample.txt > testfile2.txt

But my output looks like this:

\f4\cf1\insrsid10228738 \loch\af4\dbch\af31505\hich\f4 \u8232\'5f}{\field{*\fldinst {\rtlch\fcs1 \af4 \ltrch\fcs0 \f4\cf1\insrsid10228738 \hich\af4\dbch\af31505\loch\f4 HYPERLINK "httjps://archive.org/randomURL1?fref=grp_mmbr_list"}{ \loch\af4\dbch\af31505\hich\f4 \u8232\'5f}{\field{*\fldinst {\rtlch\fcs1 \af4 \ltrch\fcs0 \f4\cf1\insrsid10228738 \hich\af4\dbch\af31505\loch\f4 HYPERLINK "httjps://archive.org/randomURL1?fref=grp_mmbr_list"}{\rtlch\fcs1 \af4 \ltrch\fcs0 \f4\cf1\insrsid10228738 \loch\af4\dbch\af31505\hich\f4 \u8232\'5f}{\field{*\fldinst {\rtlch\fcs1 \af4 \ltrch\fcs0 \f4\cf1\insrsid10228738 \hich\af4\dbch\af31505\loch\f4 HYPERLINK "httjps://archive.org/randomURL2?fref=grp_mmbr_list"}{

It seems like it's pulling the whole line and not just the URL. Any help with confuguring the 'end of line' parameter would be very much appreciated.

Solved

grep -Eo '\"https?:\/\/[^"]+\"' testsample.txt > testfile2.txt 

Solution

  • To extract all links into a new file with grep command:

    grep -Po '\"\Khttps?:\/\/[^"]+(?=\")' testsample.txt > testfile2.txt
    

    Now, testfile2.txt file should contain the following:

    https://archive.org/randomURL1?fref=grp_mmbr_list
    https://archive.org/randomURL2?fref=grp_mmbr_list
    

    Note: if -P option isn't supported on your side, use -E to allow extended regexp expressions:

    grep -Eo '\"https?:\/\/[^"]+\"' testsample.txt > testfile2.txt 
    

    To remove all links from the initial file (in place) use sed command with -ri options:

    sed -ri 's/\"https?:\/\/[^"]+\"//g' /tmp/testsample.txt
    

    Alternative solution using re.sub() function(used a test string instead of file):

    import re
    
    s = '''
    738 \loch\af4\dbch\af31505\hich\f4 \u8232\'5f}{\field{*\fldinst {\rtlch\fcs1 \af4 \ltrch\fcs0 \f4\cf1\insrsid10228738 \hich\af4\dbch\af31505\loch\f4 HYPERLINK "https://archive.org/randomURL1?fref=grp_mmbr_list"}{ \rtlch\fcs1 \af4 \ltrch\fcs0 \f4\cf1\insrsid10228738 {*\datafield 00d0c9ea79f9bace118c8200aa004ba90b0200000003000000e0c9ea79f9bace118c8200aa004ba90b31505\hich\f4 \u8232\'5f}{\field{*\fldinst {\rtlch\fcs1 \af4 \ltrch\fcs0 \f4\cf1\insrsid10228738 \hich\af4\dbch\af31505\loch\f4 HYPERLINK "https://archive.org/randomURL2?fref=grp_mmbr_list"}{\rtlch\fcs1 \af4 \ltrch\fcs0 \f4\cf1\insrsid10228738 {*
    '''
    
    result = re.sub(r'\"https?:\/\/[^"]+\"', '', s)
    print(repr(result))
    

    The output:

    "\n738 \\loch\x07f4\\dbch\x07f31505\\hich\x0c4 舲'5f}{\x0cield{*\x0cldinst {\rtlch\x0ccs1 \x07f4 \\ltrch\x0ccs0 \x0c4\\cf1\\insrsid10228738 \\hich\x07f4\\dbch\x07f31505\\loch\x0c4 HYPERLINK }{ \rtlch\x0ccs1 \x07f4 \\ltrch\x0ccs0 \x0c4\\cf1\\insrsid10228738 {*\\datafield 00d0c9ea79f9bace118c8200aa004ba90b0200000003000000e0c9ea79f9bace118c8200aa004ba90b31505\\hich\x0c4 舲'5f}{\x0cield{*\x0cldinst {\rtlch\x0ccs1 \x07f4 \\ltrch\x0ccs0 \x0c4\\cf1\\insrsid10228738 \\hich\x07f4\\dbch\x07f31505\\loch\x0c4 HYPERLINK }{\rtlch\x0ccs1 \x07f4 \\ltrch\x0ccs0 \x0c4\\cf1\\insrsid10228738 {*\n"