Based on my specific issue I tried to create a small reproducible example without coming to a point. It is attached at the end of the question.
I am currently reading 27 excel files.
All formatted in the same way (apparently seems that all the column are consistent each other).
I am appending this data in a single DataFrame composed by 9828 rows and three columns.
One column called "Target" should be composed only by float or interger.
Actually cells that compose Target column are read as:
None
In order to understand which specific Column DataFrame row contains float/none/string I created a very inefficient way to check and split the dataframe based on the datatype
dt_t is a dataframe with 3 columns one of this is called Target
This column is composed by 9398 non-null object.
The other two by 9828 non-null object.
df_string=pd.DataFrame()
df_float=pd.DataFrame()
df_null=pd.DataFrame()
for i in range(len(df_t.Target)):
if type(df_t.Target.loc[i])==str:
df_string=df_string.append(df_t.loc[i])
if type(df_t.Target.loc[i])==int or type(df_t.Target.loc[i])==float:
df_float=df_float.append(df_t.loc[i])
else:
df_null=df_null.append(df_t.loc[i])
I can't understand why:
There is a duplicate. df_string
and df_null
contain the same value
In the float dataframe I still have nan
values
My expected output was to see
The most elegant way to extract nan values was in this answer but I think my issue is slightly different
df.loc[~df.index.isin(df.dropna().index)]
I tried also to find some resources and read the documentation to understand why pd.read_excel()
has this behaviour, but I didn't find anything useful.
Here the code I tried to create for a reproducible question (but is not working):
a=pd.Series(np.random.uniform(1.00,100.00,9000))
b=np.empty(400)
b=pd.Series(np.full_like(b,np.nan))
c=pd.Series('None')
start_dict={"Target":[0,2,3.5] }
df_t=pd.DataFrame(start_dict)
df_t=pd.concat([df_t,a],axis=0,ignore_index=True)
df_t=pd.concat([df_t,b],axis=0,ignore_index=True)
df_t=pd.concat([df_t,c],axis=0,ignore_index=True)
Here my actual Output when I print df_string
and df_null
:
Here in yellow the reference cell in the xlsx
file:
Type is not safe to use. Since columns have mixed types pandas will use the object
type for all rows. However, you can do this in another way.
First create the empty DataFrames:
df_string = pd.DataFrame()
df_float = pd.DataFrame()
df_null = pd.DataFrame()
Now you need to create a function that categorizes your original DataFrame. A slightly modified version of this question should do:
def categorize(x):
if x.isna():
df_null = df_null.append([x]) # append does not change the original df
return
try:
float(x)
except ValueError:
df_string = df_string.append([x])
else:
df_float = df_float.append([x])
Now all you have to do is apply
the function to the Target
column.
df['Target'].apply(lambda x: categorize(x), axis=1)