How to get the the required data from a string using Regexp or any string manipulation method

I have many Json responses from an instagram API which contains data like this:-

"bio": "🔥🔥5-yr online store 🎀 Real pictures💯💯 🎀Mirror Quality 🔝🔝1:1 🎀Whatsapp/Viber +861776345378 📥spikydudewonderland@gmail.com ✈️✈️Worldwide Shipping",

More Examples:-

"bio": "Девочки это наша новая страничка.Только копии Lux, искателям дешевых подделок не беспокоить. По всем вопросам viber,whatsapp +79128743333 Лианна"
"bio": "Recruitment Agents👑👑👑👑The most powerful manufacturers,we have thebest quality.📱Wechat:13255996580💜📱Whatsapp：+8618820784535
"bio": "🌸 เข้าช้อปทุกวันจ้า🌸 ซื้อกับวี้ได้ของแท้แน่นอนค่า🌸 แบรนด์อื่นสอบถามได้ค่า🌸 ดรีวิว@reviewkayasisshopp🌸 LINE ID : @kux1427k (มี @ ด้วยจ้า)

How to get the data WhatsApp/Viber Тел: +79858662461 and Email_id spikydudewonderland@gmail.com from it using Regexp or any string manipulation method known.

I want to get only the contact nos like watzapp, line, Wechat, Viber ect and Email_id's from it.

My API is in a loop and calls each time the loop executes which brings the above json response. After that I store the data in excel.

Some responses are totaly in English and some are in other languages. This is causing trouble to extract data. How to do it? Please help

Solution

This regex seems to do an acceptable job:

(?i)([\w.]+@[\w.]+)|(?:(?:\b|[,/]\s*)(?:whatsapp|viber|wechat))+\b\s*[:：]?\s*([()+\d -]+\d)|\bline(?:\sid)?\s*(?:[:：]\s*)?@?(\w+)|((?:\+\d+[ -]?)?(?:\(\d+\)[ -]?)?\d[\d -]{5,}\d)

Demo.

This captures emails in capture group 1, Whatsapp/Viber/Wechat numbers in group 2, and line IDs in group 3.

Usage example:

import re

text= '🔥🔥5-yr online store 🎀 Real pictures💯💯 🎀Mirror Quality 🔝🔝1:1 🎀Whatsapp/Viber +861776345378 📥spikydudewonderland@gmail.com ✈️✈️Worldwide Shipping'
pattern= r'(?i)([\w.]+@[\w.]+)|(?:(?:\b|[,/]\s*)(?:whatsapp|viber|wechat))+\b\s*[:：]?\s*(\+?\d+)|\bline(?:\sid)?\s*(?:[:：]\s*)?(@\w+)'

for mobj in re.finditer(pattern, text):
    if mobj.group(1):
        print 'email:', mobj.group(1)
    elif mobj.group(2):
        t= mobj.group().lower()
        if 'whatsapp' in t:
            print 'whatsapp:', mobj.group(2)
        if 'viber' in t:
            print 'viber:', mobj.group(2)
        if 'wechat' in t:
            print 'wechat:', mobj.group(2)
    elif mobj.group(3):
        print 'line:', mobj.group(3)

regex explanation:

(?i)   case insensitive
([\w.]+@[\w.]+)  something that looks like an email
|      or
(?:    a list of...
   (?:\b|[,/]\s*)
   (?:whatsapp|viber|wechat)  ...whatsapp/viber/wechat
)+\b\s*
[:：]?\s*   possibly followed by a colon
(\+?\d+)   and of course the number
|      or
\bline(?:\sid)?\s*(?:[:：]\s*)?(@\w+)   something that looks like a line id