python python-3.x pandas filter pandas-loc

Loc filter and exclude null values

1. vat.loc[(vat['Sum of VAT'].isin([np.nan, 0])) &
2.        (vat['Comment'] == "Transactions 0DKK") &
3.        (vat['Type'].isin(['Bill', 'Bill Credit'])) &
4.        (vat['Maximum of Linked Invoice'].notnull()), 'Comment'] = 'Linked invoice'
5. vat[vat["Comment"] == "Linked invoice"]

Hi all,

I have a problem with the line:

(vat['Maximum of Linked Invoice'].notnull()

It seems not to be working properly when I'm trying to exclude all of the null values in the rows. In fact, it does not exclude the null values and instead, it is included in the output from the data frame. The rest of the syntax works perfectly. I have tried using different syntax but the null values are still included in the column 'Maximum of Linked Invoice'. I don't understand why it doesn't work?

Hi again,

I've done some more research and it seems that the csv file, when imported, had 62107 non-null values for the column 'Maximum of Linked Invoice', but this incorrect, when opening the csv_file and checking, it did have thousands of blanks in the rows, but why has it not been read as null values when imported? Have you seen anything like this before?

Please see the info below

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62108 entries, 0 to 62107
Data columns (total 35 columns):
 #   Column                               Non-Null Count  Dtype  
---  ------                               --------------  -----  
 0   External ID                          62108 non-null  object 
 1   Document Number                      62107 non-null  object 
 2   Transaction Number                   62107 non-null  object 
 3   Maximum of Linked Invoice            62107 non-null  object 
 4   Type                                 62107 non-null  object 
 5   Date                                 62107 non-null  object 
 6   Period                               62107 non-null  object 
 7   Terms                                62107 non-null  object 
 8   Maximum of Due Date/Receive By       50885 non-null  object 
 9   Company Name                         62107 non-null  object 
 10  Customer VAT Registration Number     62107 non-null  object 
 11  Bill to City                         62107 non-null  object 
 12  Bill to State                        62107 non-null  object 
 13  Bill to Country                      62107 non-null  object 
 14  Bill to Zip                          62107 non-null  object 
 15  Source System                        62107 non-null  object 
 16  Source System Identifier             62107 non-null  object 
 17  City                                 62107 non-null  object 
 18  State/Province                       62107 non-null  object 
 19  Country                              62107 non-null  object 
 20  Zip                                  62107 non-null  object 
 21  Currency                             62107 non-null  object 
 22  Memo (Main)                          62107 non-null  object 
 23  Maximum of GMAX Tax Code             24189 non-null  object 
 24  Maximum of NetSuite Tax Item         59815 non-null  object 
 25  Maximum of Coupa Tax Code            0 non-null      float64
 26  Maximum of External System Tax Code  0 non-null      float64
 27  Maximum of Tax Code (Consolidated)   59815 non-null  object 
 28  FOP Type                             62107 non-null  object 
 29  Sum of Assets                        60680 non-null  float64
 30  Sum of Accounts Payable              3741 non-null   float64
 31  Sum of Other Liabilities             57066 non-null  float64
 32  Sum of Income                        60290 non-null  float64
 33  Sum of Expense                       300 non-null    float64
 34  Sum of VAT                           56269 non-null  float64
dtypes: float64(8), object(27)
memory usage: 16.6+ MB

Solution

If anyone is reading this then I have found an answer. There is nothing wrong with my syntax, but the problem lies with the CSV file itself. The reason why pandas read the column 'Maximum of Linked Invoice' as 62107 non-null, is because there was a space embedded within each row in that column. The only thing I saw at first were blank rows, but this was inaccurate. So, I urge you to check the CSV file to avoid any time-consuming efforts to solve these types of tricky problems.

And this is the solution for code line 4:

(~vat['Maximum of Linked Invoice'].isin([np.nan, ' '])