Search code examples
pythonpandasspss

Converting sav to pandas df misses last column


I am converting an SPSS .sav file into a pandas dataframe using the following code:

import pandas as pd
import savReaderWriter as spss

raw_data = spss.SavReader(filename, returnHeader = True)
raw_data_list = list(raw_data)
df = pd.DataFrame(raw_data_list)

This code works well, except the final column is not included in the dataframe.

I am converting a huge (and very inefficient) table which has 70,484 columns and 3,609 rows. However only 70,483 of the columns are in the pandas dataframe, all of the rows are there.

What is going wrong here?


Solution

  • Check your first row in your .sav file

    If you want to read data as a dataframe into Pandas. The file has the following format

    a b c d
    0 1 2 3 4 5
    1 2 3 4 5 6
    

    When you read it with Pandas you get the following dataframe

        a b c d
    0 1 2 3 4 5
    1 2 3 4 5 6
    

    When I execute print df.columns I get something like :

    Index([u'a', u'b', u'c', u'd'], dtype='object')
    

    And when I execute print df.iloc[0] I get :

    a  2
    b  3
    c  4
    d  5
    
    Name: (0, 1)
    

    I guess you would like to have some dataframe like this

    a b c d col1 col2
    0 1 2 3 4    5
    1 2 3 4 5    6
    

    Possible Solution is:

    One way to do this would be to read in the data twice. Once with the first row (the original columns) skipped and the second with only the column names read (and all the rows skipped)

    df = pd.read_csv(header=None, skiprows=1)
    columns = pd.read_csv(nrows=0).columns.tolist()
    columns
    Output
    ['a', 'b', 'c', 'd']
    

    Now find number of missing columns and use a list comprehension to make new columns

    num_missing_cols = len(df.columns) - len(columns)
    new_cols = ['col' + str(i+1) for i in range(num_missing_cols)]
    df.columns = columns + new_cols
    df
    
       a  b  c  d  col1  col2
    0  0  1  2  3     4     5
    1  1  2  3  4     5     6