Search code examples
pythonpandasnumpydata-sciencelist-comprehension

Python List Comprehension - numpy array


The shape of the NumPy array created from a list comprehension is incorrect when I use numbers above 9 Please help me correct it and also explain why this is happening. Please find below the code.

import pandas as pd
import numpy as np

sep_payment = pd.DataFrame({"Creditor":['Axis','RBL_CC','KOTAK_PL','KOTAK_CC','Cashe','SBI','HDFC_Jumbo','HDFC_CC','SCB','Tata Capital','Flex_Salary'],"Priority":[1,2,3,4,5,6,7,8,9,10,11],"Payment_Status":['Pending','Pending','Pending','Pending','Pending','Pending','Pending','Pending','Pending','Pending','Pending'],"Credit_Status":['Pending','Pending','Pending','Pending','Pending','Pending','Pending','Pending','Pending','Pending','Pending'],"Payment_Date":['-','-','-','-','-','-','-','-','-','-','-'],"Time Taken in Days":[2,5,5,2,5,2,5,5,5,5,2]})

# List comprehension Looped with range 9 NO ERRORS | Output (9, 6)
subb= sep_payment.iloc[1].to_string(index=False).split()
subb
subb2 = [sep_payment.iloc[i].to_string(index=False).split() for i in range(9)]
subb2
data= np.array(subb2)
print(data.shape)

# List comprehension Looped with range 10 ERROR in THE SHAPE printed | Output (10,)
subb= sep_payment.iloc[1].to_string(index=False).split()
subb
subb2 = [sep_payment.iloc[i].to_string(index=False).split() for i in range(10)]
subb2
data= np.array(subb2)
print(data.shape)

Dataframe

list comprehension


Solution

  • The issue you are facing is due to the space that is occurring in your data for the row for bank Tata Capital

    In part 1:

    Your first code is breaking this string (for the row) into 6 parts each since there is no space occurring between any of the tokens in the 6 columns. This results in a numpy array of (9,6) shape which is 9 rows, and 6 columns as expected.

    subb2 = [sep_payment.iloc[i].to_string(index=False).split() for i in range(9)]
    subb2
    
    [['Axis', '1', 'Pending', 'Pending', '-', '2'],
     ['RBL_CC', '2', 'Pending', 'Pending', '-', '5'],
     ['KOTAK_PL', '3', 'Pending', 'Pending', '-', '5'],
     ['KOTAK_CC', '4', 'Pending', 'Pending', '-', '2'],
     ['Cashe', '5', 'Pending', 'Pending', '-', '5'],
     ['SBI', '6', 'Pending', 'Pending', '-', '2'],
     ['HDFC_Jumbo', '7', 'Pending', 'Pending', '-', '5'],
     ['HDFC_CC', '8', 'Pending', 'Pending', '-', '5'],
     ['SCB', '9', 'Pending', 'Pending', '-', '5']]
    

    In part 2:

    In the second part, however, you are breaking all the other rows into 6 parts, BUT one of the rows into 7 parts thanks to the space in Tata Capital. When you try to convert this into a numpy array, it creates an array with 10 rows as expected, but 1 column since each of the objects in this array is a list object and counted as 1 item.

    This is because a ndarray in numpy NEEDS to have the same elements for each axis.

    subb2 = [sep_payment.iloc[i].to_string(index=False).split() for i in range(10)]
    subb2
    
    [['Axis', '1', 'Pending', 'Pending', '-', '2'],
     ['RBL_CC', '2', 'Pending', 'Pending', '-', '5'],
     ['KOTAK_PL', '3', 'Pending', 'Pending', '-', '5'],
     ['KOTAK_CC', '4', 'Pending', 'Pending', '-', '2'],
     ['Cashe', '5', 'Pending', 'Pending', '-', '5'],
     ['SBI', '6', 'Pending', 'Pending', '-', '2'],
     ['HDFC_Jumbo', '7', 'Pending', 'Pending', '-', '5'],
     ['HDFC_CC', '8', 'Pending', 'Pending', '-', '5'],
     ['SCB', '9', 'Pending', 'Pending', '-', '5'],
     ['Tata', 'Capital', '10', 'Pending', 'Pending', '-', '5']] #<-- CHECK THIS ROWS
    

    Solution:

    Just directly use df.to_numpy() instead of what you are doing to get the numpy array..

    data = sep_payment.to_numpy()
    data
    
    # array([['Axis', 1, 'Pending', 'Pending', '-', 2],
    #        ['RBL_CC', 2, 'Pending', 'Pending', '-', 5],
    #        ['KOTAK_PL', 3, 'Pending', 'Pending', '-', 5],
    #        ['KOTAK_CC', 4, 'Pending', 'Pending', '-', 2],
    #        ['Cashe', 5, 'Pending', 'Pending', '-', 5],
    #        ['SBI', 6, 'Pending', 'Pending', '-', 2],
    #        ['HDFC_Jumbo', 7, 'Pending', 'Pending', '-', 5],
    #        ['HDFC_CC', 8, 'Pending', 'Pending', '-', 5],
    #        ['SCB', 9, 'Pending', 'Pending', '-', 5],
    #        ['Tata Capital', 10, 'Pending', 'Pending', '-', 5],
    #        ['Flex_Salary', 11, 'Pending', 'Pending', '-', 2]], dtype=object)
    
    data.shape
    #(11, 6)