Search code examples
pythonexcelpandascsvdata-cleaning

How to delete duplicated elements in columns of csv


I need help with deleting duplicated elements language columns that appears more than one time using python.

Here is my csv:

f = pd.DataFrame({'Movie': ['name1','name2','name3','name4'],
                  'Year': ['1905', '1905','1906','1907'],
                  'Id': ['tt0283985', 'tt0283986','tt0284043','tt3402904'],
                  'language':['Mandarin,Mandarin','Mandarin,Cantonese,Mandarin','Mandarin,Cantonese','Cantonese,Cantonese']})

Where f now looks like:

   Movie  Year         Id   language
0  name1  1905  tt0283985  Mandarin,Mandarin
1  name2  1905  tt0283986  Mandarin,Cantonese,Mandarin
2  name3  1906  tt0284043  Mandarin,Cantonese
3  name4  1907  tt3402904  Cantonese,Cantonese

And the result should be like this:

   Movie  Year         Id             language
0  name1  1905  tt0283985            Mandarin
1  name2  1905  tt0283986            Mandarin,Cantonese
2  name3  1906  tt0284043            Mandarin,Cantonese
3  name4  1907  tt3402904            Cantonese

I am having trouble with writing a function to delete complicated values in language columns. Thanks in advance!


Solution

  • Try this:

    f['language'].str.split(',').map(lambda x: ','.join(set(x)))
    

    Output:

    0              Mandarin
    1    Mandarin,Cantonese
    2    Mandarin,Cantonese
    3             Cantonese