Search code examples
pythonpython-3.xindexingpython-unicodelist-manipulation

Split unicode by character into list


I have made a program that reads a selection of names, it is then turned into a Unicode example

StevensJohn:-:
WasouskiMike:-:
TimebombTime:-:
etc

Is there any way to make a list that would split the index so its like

example_list = ["StevensJohn", "WasouskiMike", "TimebombTim"] 

This would be dynamic so the number of names and different names would be returned from the web scrape.

Any input would be appreciated.

Code

results = unicode("""
Hospitality
Customer Care
Wick , John 12:00-20:00
Wick , John 10:00-17:00
Obama , Barack 06:00-14:00
Musk , Elon 07:00-15:00
Wasouski , Mike 06:30-14:30
 Production
Fries
Piper , Billie 12:00-20:00
Tennent , David 06:30-14:30
Telsa, Nikola 11:45-17:00
Beverages & Desserts in a Dual Lane Drive-thru with a split beverage cell
Timebomb , Tim 06:30-14:30
Freeman , Matt 08:00-16:00
Cool , Tre 11:45-17:00
Sausage
Prestly , Elvis 06:30-14:30
Fat , Mike 06:30-14:30
Knoxville , Johnny 06:00-14:00
Man , Wee 05:00-12:00
Heartness , Jack 09:00-16:00
Breakfast BOP
Schofield , Phillip 06:30-14:15
Burns , George 06:30-14:15
Johnson , Boris 06:30-14:30
Milliband, Edd 06:30-14:30
Trump , Donald 10:00-17:00
Biden , Joe 08:00-16:00
Tempering & Prep
Clinton , Hillary 11:00-19:00

""")

for span in results:
    results = results.replace(',', '')
    results = results.replace(" ", "")
    results = results.replace("/r","")
    results = results.replace(":-:", "\r")
    results = ''.join([i for i in results if not i.isdigit()])
    print(results)

Solution

  • Your edit reveals that this is really an XY problem. Your attempt to successively trim off small substrings will inevitably bump into corner cases where some substrings should not be removed some of the time. A common alternative approach is to use regular expressions.

    import re
    matches=[''.join([m.group(1), m.group(2)]) for m in re.iterfind(r"([A-Za-z']+)\s*,\s*([A-Za-z'.]+)\s+\d+:\d+-\d+:\d+", results)]
    

    Demo: https://ideone.com/1syge8

    A much better solution still is to use the structure of the surrounding HTML to extract only specific spans; most modern web sites use CSS selectors for formatting which also are quite useful for scraping. But since we can't see the original page where you extracted this string, this is entirely speculative.