Search code examples
pythonunix

Data Restructuring using python in unix


I am trying to split the data and rearrange the data in a CSV file. My data looks something like this

1:100011159-T-G,CDD3-597,G,G
1:10002775-GA,CDD3-597,G,G
1:100122796-C-T,CDD3-597,T,T
1:100152282-CAAA-T,CDD3-597,C,C
1:100011159-T-G,CDD3-598,G,G
1:100152282-CAAA-T,CDD3-598,C,C  

and I want a table that looks like this:

ID 1:100011159-T-G 1:10002775-GA 1:100122796-C-T 1:100152282-CAAA-T
CDD3-597 GG GG TT CC
CDD3-598 GG CC

I have written the following code:

import pandas as pd


input_file = "trail_berry.csv"
output_file = "trail_output_result.csv"

# Read the CSV file without header
df = pd.read_csv(input_file, header=None)
print(df[0].str.split(',', n=2, expand=True))
# Extract SNP Name, ID, and Alleles from the data
df[['SNP_Name', 'ID', 'Alleles']] = df[0].str.split(',', n=-1, expand=True)

# Create a new DataFrame with unique SNP_Name values as columns
result_df = pd.DataFrame(columns=df['SNP_Name'].unique(), dtype=str)

# Populate the new DataFrame with ID and Alleles data
for _, row in df.iterrows():
    result_df.at[row['ID'], row['SNP_Name']] = row['Alleles']

# Reset the index
result_df.reset_index(inplace=True)
result_df.rename(columns={'index': 'ID'}, inplace=True)

# Fill NaN values with an appropriate representation (e.g., 'NULL' or '')
result_df = result_df.fillna('NULL')

# Save the result to a new CSV file
result_df.to_csv(output_file, index=False)

# Print a message indicating that the file has been saved
print("Result has been saved to {}".format(output_file))

but this has been giving me the following error:

Traceback (most recent call last):
  File "berry_trail.py", line 11, in <module>
    df[['SNP_Name', 'ID', 'Alleles']] = df[0].str.split(',', n=-1, expand=True)
  File "/nas/longleaf/home/svennam/.local/lib/python3.5/site-packages/pandas/core/frame.py", line 3367, in __setitem__
    self._setitem_array(key, value)
  File "/nas/longleaf/home/svennam/.local/lib/python3.5/site-packages/pandas/core/frame.py", line 3389, in _setitem_array
    raise ValueError('Columns must be same length as key')

Can someone please help, I am having hard time figuring this out.Thanks in advance! ValueError: Columns must be same length as key


Solution

  • What are you trying to do is called pivoting, so use DataFrame.pivot():

    import pandas as pd
    
    df = pd.read_csv("your_file.csv", header=None)
    
    out = (
        df.pivot(index=1, columns=0, values=2)
        .rename_axis(index="ID", columns=None)
        .reset_index()
    )
    print(out)
    

    Prints:

             ID 1:100011159-T-G 1:10002775-GA 1:100122796-C-T 1:100152282-CAAA-T
    0  CDD3-597              GG            GG              TT                 CC
    1  CDD3-598              GG           NaN             NaN                 CC
    

    EDIT: With your updated input:

    import pandas as pd
    
    df = pd.read_csv("your_file.csv", header=None)
    
    df["tmp"] = df[[2, 3]].agg("".join, axis=1)  # <-- join the two columns together
    
    out = (
        df.pivot(index=1, columns=0, values="tmp")
        .rename_axis(index="ID", columns=None)
        .reset_index()
    )
    print(out)