Search code examples
pandasdataframesplit

How to correctly split column by delimiter?


I have to split column game by the delimiter -

df: 
                                                 game                               home_team                       away_team
0                         Bordj Menail – Hamra Annaba                            Bordj Menail                    Hamra Annaba
1                                  CA Batna – US Souf                                CA Batna                         US Souf
2                                     Eulma – Ouargla                                   Eulma                         Ouargla
1860                            Bella Vista – Miramar                             Bella Vista                         Miramar
1861                 U.A.N.L.- Tigres W – Club Leon W                      U.A.N.L.- Tigres W                     Club Leon W
1862                               Queretaro – Toluca                               Queretaro                          Toluca
0                           Sport Recife - Imperatriz               Sport Recife - Imperatriz                            None
1                                    ABC - America RN                        ABC - America RN                            None
2                           Frei Paulistano - Nautico               Frei Paulistano - Nautico                            None
3                             Botafogo PB - Confianca                 Botafogo PB - Confianca                            None

I am trying

df[team_cols] = df['game'].str.split(' – ', expand=True, n=1)

But I am only able to do so partially as above

When I look it via excel, I can see that the delimiter "appears" differently

e.g.

Sport Recife â Sport Recife ## Here delimiter is a special character?
Bordj Menail – Hamra Annaba

How can I split the values? And what is this behaviour?


Solution

  • Unclear what you mean, but I would do this this way

    import pandas as pd
    
    data = {
        'game': [
            'Bordj Menail – Hamra Annaba',
            'CA Batna – US Souf',
            'Eulma – Ouargla',
            'Bella Vista – Miramar',
            'U.A.N.L.- Tigres W – Club Leon W',
            'Queretaro – Toluca',
            'Sport Recife - Imperatriz',
            'ABC - America RN',
            'Frei Paulistano - Nautico',
            'Botafogo PB - Confianca'
        ]
    }
    
    df = pd.DataFrame(data)
    
    # Split the game column
    pattern = r'\s*[-–â]\s*'
    team_cols = ['home_team', 'away_team']
    df[team_cols] = df['game'].str.split(pattern, expand=True, n=1)
    
    # Print the result
    print(df)
    
    

    which gives

                                   game        home_team               away_team
    0       Bordj Menail – Hamra Annaba     Bordj Menail            Hamra Annaba
    1                CA Batna – US Souf         CA Batna                 US Souf
    2                   Eulma – Ouargla            Eulma                 Ouargla
    3             Bella Vista – Miramar      Bella Vista                 Miramar
    4  U.A.N.L.- Tigres W – Club Leon W         U.A.N.L.  Tigres W – Club Leon W
    5                Queretaro – Toluca        Queretaro                  Toluca
    6         Sport Recife - Imperatriz     Sport Recife              Imperatriz
    7                  ABC - America RN              ABC              America RN
    8         Frei Paulistano - Nautico  Frei Paulistano                 Nautico
    9           Botafogo PB - Confianca      Botafogo PB               Confianca