Search code examples
pythonregexurlparse

Python: extract text from string


I try to extract text from url request, but not all dict contain key with text, and when I try to use {k: v[0] for k, v in parse_qs(str).items()} to urls, I lose a lot of requests, so I try str = urllib.unquote(u[0]). After that I get strings like

смотреть лучше не бывает&clid=1955453&win=176
Jade+Jantzen&ie=utf-8&oe=utf-8&gws_rd=cr&ei=FQB0V9WbIoahsAH5zZGACg
как+скрыть+лопоухость&newwindow=1&biw=1366&bih=657&source=lnms&sa=X&sqi=2&pjf=1&ved=0ahUKEwju5cPJy83NAhUPKywKHVHXBesQ_AUICygA&dpr=1
смотреть лучше не бывает&clid=1955453&win=176
2&clid=1976874&win=85&msid=1467228292.64946.22901.24595&text=как выбрать смартфон
маскаи гейла&lr=10750&clid=1985551-210&win=213

And I want to get

смотреть лучше не бывает
Jade Jantzen
как скрыть лопоухость
смотреть лучше не бывает
как выбрать смартфон
маскаи гейла

Is any way to extract that?


Solution

  • Just split by & and take the first part:

    txt = urllib.unquote(u[0]).split("&")[0]
    

    And don't use str as a variable name - it's a built-in type name in Python.

    EDIT: Unfortunatelly this 2&clid=1976874&win=85&msid=1467228292.64946.22901.24595&text=как выбрать смартфон line has a different pattern than the others. There's no common way to handle this one together with the others. I was tempted to use regex to match Cyrillic characters but Jade Jantzen wouldn't match. So for this one line, where the desired text is at the end, something like

    txt = urllib.unquote(u[0]).split("=")[-1]
    

    would work. Still you didn't provide any actual criteria for desired text. As humans we can say how to transform what you get into what you want from this specific sample. But without clear rules of what to match, we can't provide a complete solution.

    I'm aware that some (again some) of the lines have "+" in place of " ". This can possibly be solved with .replace("+", " ").