Search code examples
python-3.xpandasdataframetweets

DataFrame: Creating a new column based on words in another Column


Novice programmer here seeking help. I have a Dataframe that looks like this:

       Message  
0  "Blah blah $AAPL"
1  "Blah blah $ABT"      
2  "Blah blah $amzn"     
3  "Blah blah $AMZN"
4  "Blah blah $KO"
5  "Blah blah $fb"
6  "Blah blah $GOOGL"
7  "Blah blah $BA"    
8  "Blah blah $BMY"   

My desired output is a new column that gives me the Cashtag used in the tweet, regardless if it is uppercase or lowercase. In this example it would be:

       Message            Cashtag
0  "Blah blah $AAPL"      "$AAPL"
1  "Blah blah $ABT"       "$ABT"
2  "Blah blah $amzn"      "$AMZN"
3  "Blah blah $AMZN"      "$AMZN"
4  "Blah blah $KO"        "$KO"
5  "Blah blah $fb"        "$FB"
6  "Blah blah $GOOGL"     "$GOOGL"
7  "Blah blah $ba"        "$BA"   
8  "Blah blah $BMY"       "$BMY" 

How can I achieve my desired output?


Solution

  • This will pull the first cashtag out of any string:

    df['Cashtag'] = df['Message'].str.extract(r'(\$[A-Za-z]{1,4})', expand=False)
    

    Check out the docs for Series.str.extract.

    Better yet, so you can group by cashtags later, I’d recommend also converting them to all upper case:

    df['Cashtag'] = df['Message'].str.extract(r'(\$[A-Za-z]{1,4})', expand=False).str.upper()