I try to extract text from url request, but not all dict contain key with text, and when I try to use {k: v[0] for k, v in parse_qs(str).items()}
to urls, I lose a lot of requests, so I try str = urllib.unquote(u[0])
.
After that I get strings like
смотреть лучше не бывает&clid=1955453&win=176
Jade+Jantzen&ie=utf-8&oe=utf-8&gws_rd=cr&ei=FQB0V9WbIoahsAH5zZGACg
как+скрыть+лопоухость&newwindow=1&biw=1366&bih=657&source=lnms&sa=X&sqi=2&pjf=1&ved=0ahUKEwju5cPJy83NAhUPKywKHVHXBesQ_AUICygA&dpr=1
смотреть лучше не бывает&clid=1955453&win=176
2&clid=1976874&win=85&msid=1467228292.64946.22901.24595&text=как выбрать смартфон
маскаи гейла&lr=10750&clid=1985551-210&win=213
And I want to get
смотреть лучше не бывает
Jade Jantzen
как скрыть лопоухость
смотреть лучше не бывает
как выбрать смартфон
маскаи гейла
Is any way to extract that?
Just split by &
and take the first part:
txt = urllib.unquote(u[0]).split("&")[0]
And don't use str
as a variable name - it's a built-in type name in Python.
EDIT:
Unfortunatelly this 2&clid=1976874&win=85&msid=1467228292.64946.22901.24595&text=как выбрать смартфон
line has a different pattern than the others. There's no common way to handle this one together with the others. I was tempted to use regex to match Cyrillic characters but Jade Jantzen
wouldn't match. So for this one line, where the desired text is at the end, something like
txt = urllib.unquote(u[0]).split("=")[-1]
would work. Still you didn't provide any actual criteria for desired text. As humans we can say how to transform what you get into what you want from this specific sample. But without clear rules of what to match, we can't provide a complete solution.
I'm aware that some (again some) of the lines have "+"
in place of " "
. This can possibly be solved with .replace("+", " ")
.