I have a text file that I want to extract links from.
The problem is that the text file is only one line with a lot of links!
Or that when I open it in Notepad it shows it in a lot files but not organized.
Sample text:
[{"participants": ["minanageh379", "xcsadc"], "conversation": [{"sender": "minanageh379", "created_at": "2019-04-12T12:51:56.560361+00:00", "media": "https://scontent-lax3-1.cdninstagram.com/vp/edddf95178aca7bf75930ab8698ee45b/5D45203B/t51.2885-15/fr/e15/s1080x1080/55823673_114448266206459_7321604432125975069_n.jpg?_nc_ht=scontent-lax3-1.cdninstagram.com&ig_cache_key=Mjg2ODYwNDMxNzU3OTI1MTE1NTAxNjQ1NTk5MDkwOTMzNzY%3D.2"}, {"sender": "minanageh379", "created_at": "2019-04-12T12:51:51.923138+00:00", "text": "sd"}, {"sender": "minanageh379", "created_at": "2019-04-12T12:51:41.689524+00:00", "text": "sdsa"}, {"sender": "xcsadc", "created_at": "2019-04-12T12:50:57.283147+00:00", "text": "👩❤️💋👩"}, {"sender": "xcsadc", "created_at": "2019-04-12T12:39:35.248517+00:00", "text": "czx"}, {"sender": "xcsadc", "created_at": "2019-04-12T12:39:34.352752+00:00", "text": "dsad"}, {"sender": "xcsadc", "created_at": "2019-04-12T12:39:30.889023+00:00", "media": "https://scontent-lax3-1.cdninstagram.com/vp/e985406d6eac06bb11c2d6052c1821a2/5D508106/t51.2885-15/e15/s640x640/56218099_577906226037731_8663356006073884002_n.jpg?_nc_ht=scontent-lax3-1.cdninstagram.com&ig_cache_key=Mjg2ODYwMjk0MjA1ODQxNzYzNjM1OTI1ODMwMjYzMTExNjg%3D.2"}, {"sender": "xcsadc", "created_at": "2019-04-12T12:38:54.823472+00:00", "text": "hi hi hi"}]}]
expected result
the updated one
{"sender": "ncccy", "created_at": "2019-01-28T17:09:29.216184+00:00", "media": "https://scontent-lax3-1.cdninstagram.com/vp/57c43d748xcasc1abf58c890c5a6df042/5D199AE8/t51.2885-15/e15/p480x480/49913269_2181952555454636_8892094125900591548_n.jpg?_nc_ht=scontent-lax3-1.cdninstagram.com&ig_cache_key=Mjg1NjgsdasdAwNjgxNTk1OTY0OTIwMTA1NTMzNDQ%3D.2"},
Try with this:
First of all, we are going to remove all characters that do not form part of a valid url plus quotes and spaces. That will remove emojis that seem to cause trouble with boost regexes on notepad++ on some circustances.
Our first replacement will be:
Search: [^a-zA-Z0-9_\-.~:\/?#\[\]@!$&'()*+,;=%"\s]
Replace by: (leave empty)
Replace all
(That previous step may not be needed on future versiones on notepad++)
After the clean up, we do the following replacement:
Search: (?i)(?:(?:(?!https?:).(?!https?:))*?"sender"\s*+:\s*+"([^"]*)"|\G)(?:.(?!"sender"\s*+:\s*+))*?(https?:.*?(?=[^a-zA-Z0-9_\-.~:\/?#\[\]@!$&'()*+,;=%]|https?:))|.*
Replacement: (?{1}\n\n\1\t\2:(?{2}\t\2)
Replace all
This should work even with "text" attributes that have several urls inside. The urls will be separated by tabulators.
So after applying the previous procedure to this data:
[{"participants": ["minanageh379", "xcsadc"], "conversation": [{"sender": "minanageh379", "created_at": "2019-04-12T12:51:56.560361+00:00", "media": "https://scontent-lax3-1.cdninstagram.com/vp/edddf95178aca7bf75930ab8698ee45b/5D45203B/t51.2885-15/fr/e15/s1080x1080/55823673_114448266206459_7321604432125975069_n.jpg?_nc_ht=scontent-lax3-1.cdninstagram.com&ig_cache_key=Mjg2ODYwNDMxNzU3OTI1MTE1NTAxNjQ1NTk5MDkwOTMzNzY%3D.2 http://foo.barhttps://bar.foo"}, {"sender": "minanageh379", "created_at": "2019-04-12T12:51:51.923138+00:00", "text": "sd"}, {"sender": "minanageh379", "created_at": "2019-04-12T12:51:41.689524+00:00", "text": "sdsa"}, {"sender": "xcsadc", "created_at": "2019-04-12T12:50:57.283147+00:00", "text": "👩❤️💋👩"}, {"sender": "xcsadc", "created_at": "2019-04-12T12:39:35.248517+00:00", "text": "czx"}, {"sender": "xcsadc", "created_at": "2019-04-12T12:39:34.352752+00:00", "text": "dsad"}, {"sender": "xcsadc", "created_at": "2019-04-12T12:39:30.889023+00:00", "media": "https://scontent-lax3-1.cdninstagram.com/vp/e985406d6eac06bb11c2d6052c1821a2/5D508106/t51.2885-15/e15/s640x640/56218099_577906226037731_8663356006073884002_n.jpg?_nc_ht=scontent-lax3-1.cdninstagram.com&ig_cache_key=Mjg2ODYwMjk0MjA1ODQxNzYzNjM1OTI1ODMwMjYzMTExNjg%3D.2"}, {"sender": "xcsadc", "created_at": "2019-04-12T12:38:54.823472+00:00", "text": "hi hi hi"}, {"sender": "no_media_no_text", "created_at": "2019-04-12T12:38:54.823472+00:00"}, {"sender": "url_inside_text", "created_at": "2019-04-12T12:38:54.823472+00:00", "text": "Hi! {check} this url: \"http://foo.bar\" another url: https://new.url.com/ yet another one: https://google.com/"}, {"sender": "ncccy", "created_at": "2019-01-28T17:09:29.216184+00:00", "media": "https://scontent-lax3-1.cdninstagram.com/vp/57c43d748xcasc1abf58c890c5a6df042/5D199AE8/t51.2885-15/e15/p480x480/49913269_2181952555454636_8892094125900591548_n.jpg?_nc_ht=scontent-lax3-1.cdninstagram.com&ig_cache_key=Mjg1NjgsdasdAwNjgxNTk1OTY0OTIwMTA1NTMzNDQ%3D.2"}, {"sender": "ny", "created_at": "2017-10-22T20:49:50.042588+00:00", "media": "https://scontent-lax3-1.cdninstagram.com/vp/19d94ea45c2102a0f7c97838ef546b93/5D14B3C3/t51.2885-15/e15/22708873_149637425772501_5029503881546039296_n.jpg?_nc_ht=scontent-lax3-1.cdninstagram.com&ig_cache_key=Mjc4MzA3MDIyMTI3NDE3Njc3NTQxNTM1NTI2MjQyMjIyMDg%3D.2"}, {"sender": "xcsadc", "created_at": "2019-04-12T12:39:35.248517+00:00", "text": "czx"}]}]
We get:
minanageh379 https://scontent-lax3-1.cdninstagram.com/vp/edddf95178aca7bf75930ab8698ee45b/5D45203B/t51.2885-15/fr/e15/s1080x1080/55823673_114448266206459_7321604432125975069_n.jpg?_nc_ht=scontent-lax3-1.cdninstagram.com&ig_cache_key=Mjg2ODYwNDMxNzU3OTI1MTE1NTAxNjQ1NTk5MDkwOTMzNzY%3D.2 http://foo.bar https://bar.foo
xcsadc https://scontent-lax3-1.cdninstagram.com/vp/e985406d6eac06bb11c2d6052c1821a2/5D508106/t51.2885-15/e15/s640x640/56218099_577906226037731_8663356006073884002_n.jpg?_nc_ht=scontent-lax3-1.cdninstagram.com&ig_cache_key=Mjg2ODYwMjk0MjA1ODQxNzYzNjM1OTI1ODMwMjYzMTExNjg%3D.2
url_inside_text http://foo.bar https://new.url.com/ https://google.com/
ncccy https://scontent-lax3-1.cdninstagram.com/vp/57c43d748xcasc1abf58c890c5a6df042/5D199AE8/t51.2885-15/e15/p480x480/49913269_2181952555454636_8892094125900591548_n.jpg?_nc_ht=scontent-lax3-1.cdninstagram.com&ig_cache_key=Mjg1NjgsdasdAwNjgxNTk1OTY0OTIwMTA1NTMzNDQ%3D.2
ny https://scontent-lax3-1.cdninstagram.com/vp/19d94ea45c2102a0f7c97838ef546b93/5D14B3C3/t51.2885-15/e15/22708873_149637425772501_5029503881546039296_n.jpg?_nc_ht=scontent-lax3-1.cdninstagram.com&ig_cache_key=Mjc4MzA3MDIyMTI3NDE3Njc3NTQxNTM1NTI2MjQyMjIyMDg%3D.2
It may happen that you may get duplicated urls if those are duplicated on the original input (on same or different attributes).
Once processed you can remove duplicates with this regex:
Search: (?i)\t(https?:\S++)(?=[^\n]+\1)
Replace by: (nothing)
Replace All