Search code examples
pythonpandasdataframedata-wrangling

Python - Data Transformation


I am trying to convert this "dataframe" into this other "dataframe" in python:

From this:

job_posting,Job_description,rating,company_name,big_data,spark,hadoop,power bi,excel,python,azure,Founded,Location
1,blalbblablabla,3,Exson,1,0,1,,,,0,2007,US
2,blalbblablabla,4,1,,,,1,1,,0,2010,EU
3,blalbblablabla,0,Wine20,1,1,,,,1,1,2000,LA

To this:

job_posting,Job_description,rating,company_name,technologies_required,Founded,Location
1,blalbblablabla,3,Exson,big_data, hadoop,2007,US
2,blalbblablabla,4,1,power bi, excel,2010,EU
3,blalbblablabla,0,Wine20,big_data, spark, python, azure,2000,LA

This is what it could look like: enter image description here

Any help will be appreciated.

::::::::::::::::::::::::::::::::::::::::::::::::::

New revised code. @Timeless thanks!

# Create "technologies_required"
df['technologies_required'] = df.apply(lambda x: ", ".join(x.index[x.eq(1) & ~x.index.isin(['job_posting', 'for_example_rating'])]), axis=1)

# Reorder columns
new_df = df[['job_posting', 'for_example_rating', 'technologies_required']]

Now, it doesn't matter if any other column includes a number 1 in their rows.

Unfortunately I don't know if this is good or bad practice :-(


Solution

  • You can use apply to boolean-index the columns names :

    techs = df.apply(lambda x: ", ".join(x.index[x.eq(1)]), axis=1)
    #techs = df.apply(lambda x: ", ".join(x.eq(1).loc[lambda s: s].index), axis=1) #variant
    
    out = df[["job_posting"]].join(techs.rename("technologies_required"))
    

    Output :

    print(out)
    
       job_posting technologies_required
    0         5674       big_data, spark
    1        13037                 excel
    2         4377        powerbi, azure
    

    Update :

    There is a little problem in this code: If job_posting or any other column (not techs) has a value of 1, then that column is included as well.

    lcols = ["job_posting", "lob_description", "rating", "company_name"]
    rcols = ["Founded", "Location"]
    
    techs = df.drop(columns=lcols+rcols).apply(lambda x: ", ".join(x.index[x.eq(1)]), axis=1)
    
    out = pd.concat([df[lcols], techs.rename("technologies_required"), df[rcols]], axis=1)
    

    Output :

    print(out)
    
       job_posting lob_description  rating company_name           technologies_required  Founded Location
    0            1  blalbblablabla       3        Exson                big data, hadoop     2007       US
    1            2  blalbblablabla       1            1                 excel, power bi     2010       EU
    2            3  blalbblablabla       0       Wine20  azure, big data, python, spark     2000       LA