Search code examples
pythonstringextractformatted

extract strings from patterned string list and convert it into dataFrame in python


I have a list which contains patterned string like this:

['"Bandcamp" (2014)\t\t\t\t\ttv-mini-series',
'"ByMySide" (2012){The Happening (#1.3)}\t\t\t\t\ttwitter-hashtag-in-title',
'"Elmira" (2014)\t\t\t\t\telmira-new-york',
'"Elmira" (2014){The Happening (#1.3)}\t\t\tfriend',
...]

Now, I am trying to extract sub-strings from each line, and make them into a data frame like:

Movie    Year Keyword
Bandcamp 2014 tv-mini-series
ByMySide 2012 twitter-hashtag-in-title
Elmira   2014 elmira-new-york
Elmira   2014 friend
...

Solution

  • Here you go:

    >>> a
    ['"Bandcamp" (2014)\t\t\t\t\ttv-mini-series', '"ByMySide" (2012){The Happening (#1.3)}\t\t\t\t\ttwitter-hashtag-in-title', '"Elmira" (2014)\t\t\t\t\telmira-new-york', '"Elmira" (2014){The Happening (#1.3)}\t\t\tfriend']
    >>> data = []
    >>> for x in a:
    ...     data.append(re.findall("\"(\w+)\" \((\d+)\).*\t{2,5}(\S+)", x)[0])
    ... 
    >>> import pandas as pd
    >>> pd.DataFrame(data, columns=['Movie', 'Year', 'Keyword'])
          Movie  Year                   Keyword
    0  Bandcamp  2014            tv-mini-series
    1  ByMySide  2012  twitter-hashtag-in-title
    2    Elmira  2014           elmira-new-york
    3    Elmira  2014                    friend