Search code examples
urltextcopynotepad++line

How can I extract URLs from a huge one line text file?


I have a text file that I want to extract links from.

The problem is that the text file is only one line with a lot of links!

Or that when I open it in Notepad it shows it in a lot files but not organized.

Sample text:

[{"participants": ["minanageh379", "xcsadc"], "conversation": [{"sender": "minanageh379", "created_at": "2019-04-12T12:51:56.560361+00:00", "media": "https://scontent-lax3-1.cdninstagram.com/vp/edddf95178aca7bf75930ab8698ee45b/5D45203B/t51.2885-15/fr/e15/s1080x1080/55823673_114448266206459_7321604432125975069_n.jpg?_nc_ht=scontent-lax3-1.cdninstagram.com&ig_cache_key=Mjg2ODYwNDMxNzU3OTI1MTE1NTAxNjQ1NTk5MDkwOTMzNzY%3D.2"}, {"sender": "minanageh379", "created_at": "2019-04-12T12:51:51.923138+00:00", "text": "sd"}, {"sender": "minanageh379", "created_at": "2019-04-12T12:51:41.689524+00:00", "text": "sdsa"}, {"sender": "xcsadc", "created_at": "2019-04-12T12:50:57.283147+00:00", "text": "👩‍❤️‍💋‍👩"}, {"sender": "xcsadc", "created_at": "2019-04-12T12:39:35.248517+00:00", "text": "czx"}, {"sender": "xcsadc", "created_at": "2019-04-12T12:39:34.352752+00:00", "text": "dsad"}, {"sender": "xcsadc", "created_at": "2019-04-12T12:39:30.889023+00:00", "media": "https://scontent-lax3-1.cdninstagram.com/vp/e985406d6eac06bb11c2d6052c1821a2/5D508106/t51.2885-15/e15/s640x640/56218099_577906226037731_8663356006073884002_n.jpg?_nc_ht=scontent-lax3-1.cdninstagram.com&ig_cache_key=Mjg2ODYwMjk0MjA1ODQxNzYzNjM1OTI1ODMwMjYzMTExNjg%3D.2"}, {"sender": "xcsadc", "created_at": "2019-04-12T12:38:54.823472+00:00", "text": "hi hi hi"}]}]

expected result

https://scontent-lax3-1.cdninstagram.com/vp/edddf95178aca7bf75930ab8698ee45b/5D45203B/t51.2885-15/fr/e15/s1080x1080/55823673_114448266206459_7321604432125975069_n.jpg?_nc_ht=scontent-lax3-1.cdninstagram.com&ig_cache_key=Mjg2ODYwNDMxNzU3OTI1MTE1NTAxNjQ1NTk5MDkwOTMzNzY%3D.2

https://scontent-lax3-1.cdninstagram.com/vp/e985406d6eac06bb11c2d6052c1821a2/5D508106/t51.2885-15/e15/s640x640/56218099_577906226037731_8663356006073884002_n.jpg?_nc_ht=scontent-lax3-1.cdninstagram.com&ig_cache_key=Mjg2ODYwMjk0MjA1ODQxNzYzNjM1OTI1ODMwMjYzMTExNjg%3D.2

the updated one

{"sender": "ncccy", "created_at": "2019-01-28T17:09:29.216184+00:00", "media": "https://scontent-lax3-1.cdninstagram.com/vp/57c43d748xcasc1abf58c890c5a6df042/5D199AE8/t51.2885-15/e15/p480x480/49913269_2181952555454636_8892094125900591548_n.jpg?_nc_ht=scontent-lax3-1.cdninstagram.com&ig_cache_key=Mjg1NjgsdasdAwNjgxNTk1OTY0OTIwMTA1NTMzNDQ%3D.2"},

Solution

  • Try with this:

    First of all, we are going to remove all characters that do not form part of a valid url plus quotes and spaces. That will remove emojis that seem to cause trouble with boost regexes on notepad++ on some circustances.

    Our first replacement will be:

    Search: [^a-zA-Z0-9_\-.~:\/?#\[\]@!$&'()*+,;=%"\s]

    Replace by: (leave empty)

    Replace all

    (That previous step may not be needed on future versiones on notepad++)

    After the clean up, we do the following replacement:

    Search: (?i)(?:(?:(?!https?:).(?!https?:))*?"sender"\s*+:\s*+"([^"]*)"|\G)(?:.(?!"sender"\s*+:\s*+))*?(https?:.*?(?=[^a-zA-Z0-9_\-.~:\/?#\[\]@!$&'()*+,;=%]|https?:))|.*

    Replacement: (?{1}\n\n\1\t\2:(?{2}\t\2)

    Replace all

    This should work even with "text" attributes that have several urls inside. The urls will be separated by tabulators.

    So after applying the previous procedure to this data:

    [{"participants": ["minanageh379", "xcsadc"], "conversation": [{"sender": "minanageh379", "created_at": "2019-04-12T12:51:56.560361+00:00", "media": "https://scontent-lax3-1.cdninstagram.com/vp/edddf95178aca7bf75930ab8698ee45b/5D45203B/t51.2885-15/fr/e15/s1080x1080/55823673_114448266206459_7321604432125975069_n.jpg?_nc_ht=scontent-lax3-1.cdninstagram.com&ig_cache_key=Mjg2ODYwNDMxNzU3OTI1MTE1NTAxNjQ1NTk5MDkwOTMzNzY%3D.2   http://foo.barhttps://bar.foo"}, {"sender": "minanageh379", "created_at": "2019-04-12T12:51:51.923138+00:00", "text": "sd"}, {"sender": "minanageh379", "created_at": "2019-04-12T12:51:41.689524+00:00", "text": "sdsa"}, {"sender": "xcsadc", "created_at": "2019-04-12T12:50:57.283147+00:00", "text": "👩‍❤️‍💋‍👩"}, {"sender": "xcsadc", "created_at": "2019-04-12T12:39:35.248517+00:00", "text": "czx"}, {"sender": "xcsadc", "created_at": "2019-04-12T12:39:34.352752+00:00", "text": "dsad"}, {"sender": "xcsadc", "created_at": "2019-04-12T12:39:30.889023+00:00", "media": "https://scontent-lax3-1.cdninstagram.com/vp/e985406d6eac06bb11c2d6052c1821a2/5D508106/t51.2885-15/e15/s640x640/56218099_577906226037731_8663356006073884002_n.jpg?_nc_ht=scontent-lax3-1.cdninstagram.com&ig_cache_key=Mjg2ODYwMjk0MjA1ODQxNzYzNjM1OTI1ODMwMjYzMTExNjg%3D.2"}, {"sender": "xcsadc", "created_at": "2019-04-12T12:38:54.823472+00:00", "text": "hi hi hi"}, {"sender": "no_media_no_text", "created_at": "2019-04-12T12:38:54.823472+00:00"}, {"sender": "url_inside_text", "created_at": "2019-04-12T12:38:54.823472+00:00", "text": "Hi! {check} this url: \"http://foo.bar\" another url: https://new.url.com/ yet another one: https://google.com/"}, {"sender": "ncccy", "created_at": "2019-01-28T17:09:29.216184+00:00", "media": "https://scontent-lax3-1.cdninstagram.com/vp/57c43d748xcasc1abf58c890c5a6df042/5D199AE8/t51.2885-15/e15/p480x480/49913269_2181952555454636_8892094125900591548_n.jpg?_nc_ht=scontent-lax3-1.cdninstagram.com&ig_cache_key=Mjg1NjgsdasdAwNjgxNTk1OTY0OTIwMTA1NTMzNDQ%3D.2"}, {"sender": "ny", "created_at": "2017-10-22T20:49:50.042588+00:00", "media": "https://scontent-lax3-1.cdninstagram.com/vp/19d94ea45c2102a0f7c97838ef546b93/5D14B3C3/t51.2885-15/e15/22708873_149637425772501_5029503881546039296_n.jpg?_nc_ht=scontent-lax3-1.cdninstagram.com&ig_cache_key=Mjc4MzA3MDIyMTI3NDE3Njc3NTQxNTM1NTI2MjQyMjIyMDg%3D.2"}, {"sender": "xcsadc", "created_at": "2019-04-12T12:39:35.248517+00:00", "text": "czx"}]}]
    

    We get:

    minanageh379    https://scontent-lax3-1.cdninstagram.com/vp/edddf95178aca7bf75930ab8698ee45b/5D45203B/t51.2885-15/fr/e15/s1080x1080/55823673_114448266206459_7321604432125975069_n.jpg?_nc_ht=scontent-lax3-1.cdninstagram.com&ig_cache_key=Mjg2ODYwNDMxNzU3OTI1MTE1NTAxNjQ1NTk5MDkwOTMzNzY%3D.2    http://foo.bar  https://bar.foo
    
    xcsadc  https://scontent-lax3-1.cdninstagram.com/vp/e985406d6eac06bb11c2d6052c1821a2/5D508106/t51.2885-15/e15/s640x640/56218099_577906226037731_8663356006073884002_n.jpg?_nc_ht=scontent-lax3-1.cdninstagram.com&ig_cache_key=Mjg2ODYwMjk0MjA1ODQxNzYzNjM1OTI1ODMwMjYzMTExNjg%3D.2
    
    url_inside_text http://foo.bar  https://new.url.com/    https://google.com/
    
    ncccy   https://scontent-lax3-1.cdninstagram.com/vp/57c43d748xcasc1abf58c890c5a6df042/5D199AE8/t51.2885-15/e15/p480x480/49913269_2181952555454636_8892094125900591548_n.jpg?_nc_ht=scontent-lax3-1.cdninstagram.com&ig_cache_key=Mjg1NjgsdasdAwNjgxNTk1OTY0OTIwMTA1NTMzNDQ%3D.2
    
    ny  https://scontent-lax3-1.cdninstagram.com/vp/19d94ea45c2102a0f7c97838ef546b93/5D14B3C3/t51.2885-15/e15/22708873_149637425772501_5029503881546039296_n.jpg?_nc_ht=scontent-lax3-1.cdninstagram.com&ig_cache_key=Mjc4MzA3MDIyMTI3NDE3Njc3NTQxNTM1NTI2MjQyMjIyMDg%3D.2
    

    It may happen that you may get duplicated urls if those are duplicated on the original input (on same or different attributes).

    Once processed you can remove duplicates with this regex:

    Search: (?i)\t(https?:\S++)(?=[^\n]+\1)

    Replace by: (nothing)

    Replace All