Search code examples
regexarraysmatlabtwittercell

Create a cell array in matlab


I have a file of tweets that I have read into matlab using dataread and I have stored each line into a 30x1 cell. I was wondering if there was a to take each hashtag out and store them in their own cell and then find the average length of a hashtag? Any help would be greatly appreciated.


Solution

  • You have the right idea, I think, with your regexp call. I will just clarify a few things. If you want the text in every hashtag in the tweet, you would want to use regexp to search for the pound sign (#) and include every character after that, until you reach the end of the word, e.g.

    text = '#this #is a #test';
    regexpi(lines,'\<#[a-z0-9_]*\>','match');
    ans = 
        '#this'    '#is'    '#test'
    

    where regexpi is a case-insensitive regexp, and the regex searches for '#' followed by a any number of letters, digits, or underscores (which are, I believe, the valid hashtag characters). The 'match' flag makes the regexp function return the actual matches.

    If you don't want the actual hashtag in the final text, you could use regex look-behinds to return only the text. For instance:

    regexpi(lines,'\<(?<=#)[a-z0-9_]*\>','match')
    ans = 
        'this'    'is'    'test'
    

    I think, technically, a hashtag must start with a letter, so this regex would return potentially invalid hashtags. It's not difficult to sort that out though.