Search code examples
pythonregexpandastwittermention

Extracting @mentions from tweets using findall python (Giving incorrect results)


I have a csv file something like this

text
RT @CritCareMed: New Article: Male-Predominant Plasma Transfusion Strategy for Preventing Transfusion-Related Acute Lung Injury... htp://…
#CRISPR Inversion of CTCF Sites Alters Genome Topology & Enhancer/Promoter Function in @CellCellPress htp://.co/HrjDwbm7NN
RT @gvwilson: Where's the theory for software engineering? Behind a paywall, that's where. htp://.co/1t3TymiF3M #semat #fail
RT @sciencemagazine: What’s killing off the sea stars? htp://.co/J19FnigwM9 #ecology
RT @MHendr1cks: Eve Marder describes a horror that is familiar to worm connectome gazers. htp://.co/AEqc7NOWoR via @nucAmbiguous htp://…

I want to extract all the mentions (starting with '@') from the tweet text. So far I have done this

import pandas as pd
import re

mydata = pd.read_csv("C:/Users/file.csv")
X = mydata.ix[:,:]
X=X.iloc[:,:1] #I have multiple columns so I'm selecting the first column only that is 'text'

for i in range(X.shape[0]):
result = re.findall("(^|[^@\w])@(\w{1,25})", str(X.iloc[:i,:]))

print(result);

There are two problems here: First: at str(X.iloc[:1,:]) it gives me ['CritCareMed'] which is not ok as it should give me ['CellCellPress'], and at str(X.iloc[:2,:]) it again gives me ['CritCareMed'] which is of course not fine again. The final result I'm getting is

[(' ', 'CritCareMed'), (' ', 'gvwilson'), (' ', 'sciencemagazine')]

It doesn't include the mentions in 2nd row and both two mentions in last row. What I want should look something like this:

enter image description here

How can I achieve these results? this is just a sample data my original data has lots of tweets so is the approach ok?


Solution

  • You can use str.findall method to avoid the for loop, use negative look behind to replace (^|[^@\w]) which forms another capture group you don't need in your regex:

    df['mention'] = df.text.str.findall(r'(?<![@\w])@(\w{1,25})').apply(','.join)
    df
    #                                                text   mention
    #0  RT @CritCareMed: New Article: Male-Predominant...   CritCareMed
    #1  #CRISPR Inversion of CTCF Sites Alters Genome ...   CellCellPress
    #2  RT @gvwilson: Where's the theory for software ...   gvwilson
    #3  RT @sciencemagazine: What’s killing off the se...   sciencemagazine
    #4  RT @MHendr1cks: Eve Marder describes a horror ...   MHendr1cks,nucAmbiguous
    

    Also X.iloc[:i,:] gives back a data frame, so str(X.iloc[:i,:]) gives you the string representation of a data frame, which is very different from the element in the cell, to extract the actual string from the text column, you can use X.text.iloc[0], or a better way to iterate through a column, use iteritems:

    import re
    for index, s in df.text.iteritems():
        result = re.findall("(?<![@\w])@(\w{1,25})", s)
        print(','.join(result))
    
    #CritCareMed
    #CellCellPress
    #gvwilson
    #sciencemagazine
    #MHendr1cks,nucAmbiguous