python-3.x pandas dataframe dummy-variable

Use get_dummies on columns with return character separated values

I have a data frame in which a column has values as if it were a list, but separated by a return character (\n) instead of a comma. I tried to use the get_dummies function as below but without success.

Is it possible to use the get_dummies function directly? Or the need to replace the return character with a comma?

# import xlsx:
parques = pd.read_excel('Tabelão.xlsx')

# get_dummies:
parques = pd.get_dummies(parques, columns = ['Atividades', 'Configuração'])

# Dataframe example:
Atividades = ['esportes\nrecreação infantil\ncontemplação', 'contemplação\nrecreação infantil\nesporte', 'contemplação\nrecreação infantil', 'contemplação\nrecreação infantil\neventos culturais']
Configuração = ['relevo plano\nriacho\nlagos\nbosque\nrede de lojas', 'beria-rio\nedificações\nesplandanadas\nrede de lojas', 'bosque\nrede de caminhos\nrecantos ', 'relevo predominantemente plano\nlago\nriacho']
Nome = ['Parque Julien Rien', 'Parque da Residência', 'Feliz Lusitânia', 'Parque Barigüi']

parques = pd.DataFrame([Nome, Atividades, Configuração])

parques = parques.T

parques.columns = ['Nome', 'Atividades', 'Configuração']

Result: columns with all values concatenated.

Solution

You are going to have to clean up your data quite a bit in order to get the get_dummies function to work properly. The best way to use get_dummies is to have tidy data so that one row is one observation. In this case I have one row being either one Acitivity or one park feature the park has. So taking your example this is what I did

# Dataframe example:
Atividades = ['esportes\nrecreação infantil\ncontemplação', 
              'contemplação\nrecreação infantil\nesporte',
              'contemplação\nrecreação infantil', 
              'contemplação\nrecreação infantil\neventos culturais']
Configuracao = ['relevo plano\nriacho\nlagos\nbosque\nrede de lojas', 
                'beria-rio\nedificações\nesplandanadas\nrede de lojas', 
                'bosque\nrede decaminhos\nrecantos ', 
                'relevo predominantemente plano\nlago\nriacho']
Nome = ['Parque Julien Rien', 'Parque da Residência', 
        'Feliz Lusitânia','Parque Barigüi']
#splits the strings on the \n symbol to create lists of attributes for each park
Atividades = [x.split('\n') for x in Atividades]
Configuracao = [x.split('\n') for x in Configuracao]

#this tidys the data so that one row is one observation which 
#makes using get_dummies easier
list_df = []
i = 0
for name in Nome:
    for y in range(len(Atividades[i])):
        list_df.append([name, Atividades[i][y]])
    for x in range(len(Configuracao[i])):
        list_df.append([name, Configuracao[i][x]])
    i += 1
#creates the dataframe from the list of lists and then turns it into a
#dummy dataframe where the park name is the index value and a column has
#a 1 or 0 if the park has that attribute
test_df = pd.DataFrame(list_df, columns=['park_name', 'attributes'])
dummies = pd.get_dummies(test_df, columns=['attributes']).groupby(['park_name']).sum()

Which gives this output cleaned up as best as i can for display here:

               beria-rio    bosque contemplação edificações esplandanadas
park_name                   
Feliz Lusitânia         0   1   1   0   0
Parque Barigüi          0   0   1   0   0
Parque Julien Rien      0   1   1   0   0
Parque da Residência    1   0   1   1   1