Search code examples
pythonpandasmulti-index

Multi Index just repeats columns instead of merging them


I have a csv file that looks like this:

Val1        Val2              Val3        Val4
Row1  Row2  Row3  Row4  Row5  Row6  Row7  Row8
1     2     3     4     5     6     7     8 

I read it as df=pd.read_csv('file.csv', header=[0,1])

That produces the following:

Val1  Unnamed_1 Unnamed_2  Val2   Unnamed_3 ...
Row1  Row2      Row3       Row4  Row5       ...
1     2         3          4     5          ... 

Namely, it fills the empty headers at level 0 with "Unnamed_x". I was trying to manually correct that to get what I wanted, which is val1 spans Row1 and Row2, then Val2 rows 3, 4, and 5, etc. For that I used the following inelegant solution:

tups = []

for col in df.columns:
    if "Unnamed" not in col[0]:
        tups.append(col)
        current = col[0]
    else:
        cols = (current, col[1])
        tups.append(col)

df.columns = pd.MultiIndex.from_tuples(tups, names=['Vals', 'Rows'])

However, that simply copies the columns over, namely:

Val1  Val1  Val2  Val2  Val2  Val3  Val3  Val4
Row1  Row2  Row3  Row4  Row5  Row6  Row7  Row8
1     2     3     4     5     6     7     8 

and when I call df['Val1'], then I, expectedly, get the error that multi index have to be unique.

I am very confused because this seems a relatively straightforward thing to do, but for some reason it isn't working. I made sure to closely follow the documentation in the official pandas website. I would appreciate if someone can help me figure out how to fix this; or even better, if there is a way to directly tell pandas that the empty columns in the header should have the same value as the value to the left of them.


Solution

  • I tried a slightly different approach: reading the headers and the content and then sticking them together. No issues with the MultiIndex for me

    headers = pd.read_csv(StringIO("""Val1,,Val2,,,Val3,Val4,
    Row1,Row2,Row3,Row4,Row5,Row6,Row7,Row8
    1,2,3,4,5,6,7,8"""),header=None,nrows=2)
    
    df=pd.read_csv(StringIO("""Val1,,Val2,,,Val3,Val4,
    Row1,Row2,Row3,Row4,Row5,Row6,Row7,Row8
    1,2,3,4,5,6,7,8"""),header=None,skiprows=2)
    
    newCols = pd.MultiIndex.from_frame(headers.T.ffill())
    
    df.columns = newCols
    
    >>> df
    0 Val1      Val2           Val3 Val4
    1 Row1 Row2 Row3 Row4 Row5 Row6 Row7 Row8
    2    1    2    3    4    5    6    7    8
    >>> df['Val1']
    1 Row1 Row2
    2    1    2
    >>>