Using Featuretools, I want to convert the value of a certain feature to rank.
This will be the exact question. If anyone can help me, please answer.
First, the following code uses the rank function of pandas and displays the result. I believe this result is correct.
import pandas as pd
df = pd.DataFrame({'col1': [50, 80, 100, 80,90,100,150],
'col2': [0.3, 0.05, 0.1, 0.1,0.4,0.7,0.9]})
print(df.rank(method="dense",ascending=True))
However, when I create a custom primitive and run the following code, the results are different. Why is this happend? Please fix my code if it is wrong. Thank you very much for your help.
from featuretools.primitives import TransformPrimitive
from featuretools.variable_types import Numeric
import pandas as pd
class Rank(TransformPrimitive):
name = 'rank'
input_types = [Numeric]
return_type = Numeric
def get_function(self):
def rank(column):
return column.rank(method="dense",ascending=True)
return rank
df = pd.DataFrame({'col1': [50, 80, 100, 80,90,100,150],
'col2': [0.3, 0.05, 0.1, 0.1,0.4,0.7,0.9]})
import featuretools as ft
es = ft.EntitySet(id="test_es",
entities=None,
relationships=None)
es.entity_from_dataframe(entity_id="data",
dataframe=df,
index="index",
variable_types=None,
make_index=True,
time_index=None,
secondary_time_index=None,
already_sorted=False)
feature_matrix, feature_defs = ft.dfs(entities=None,
relationships=None,
entityset=es,
target_entity="data",
cutoff_time=None,
instance_ids=None,
agg_primitives=None,
trans_primitives=[Rank],
groupby_trans_primitives=None,
allowed_paths=None,
max_depth=2,
ignore_entities=None,
ignore_variables=None,
primitive_options=None,
seed_features=None,
drop_contains=None,
drop_exact=None,
where_primitives=None,
max_features=-1,
cutoff_time_in_index=False,
save_progress=None,
features_only=False,
training_window=None,
approximate=None,
chunk_size=None,
n_jobs=-1,
dask_kwargs=None,
verbose=False,
return_variable_types=None,
progress_callback=None,
include_cutoff_time=False)
feature_matrix
Here is the result.
However, when I tried the following code, I was able to get the correct data. Why are the answers different?
import pandas as pd
df = pd.DataFrame({'col1': [50, 80, 100, 80,90,100,150],
'col2': [0.3, 0.05, 0.1, 0.1,0.4,0.7,0.9]})
print(df.rank(method="dense",ascending=True))
pd.set_option('display.max_columns', 2000)
import featuretools as ft
es = ft.EntitySet()
es.entity_from_dataframe(entity_id='data',
dataframe=df,
index='index')
fm, fd = ft.dfs(entityset=es,
target_entity='data',
trans_primitives=[Rank])
fm
NEW ANSWER:
Based on your updated code, the problem is arising because you are setting njobs=-1
. When you do this, behind the scenes, Featuretools is distributing the calculation of the feature matrix to multiple workers. In doing so, Featuretools is breaking up the dataframe for calculating the transform feature values among the workers and sending pieces to each worker.
This creates a problem with the Rank
primitive you have defined as this primitive requires all of the data to be present to get a correct answer. For situations like this you need to set uses_full_entity=True
when defining the primitive to force featuretools to include all of the data when the primitive function is called to compute the feature values.
If you update the Rank
primitive definition as follows, you will get the correct answer:
class Rank(TransformPrimitive):
name = 'rank'
input_types = [Numeric]
return_type = Numeric
uses_full_entity = True
def get_function(self):
def rank(column):
return column.rank(method="dense",ascending=True)
return rank
OLD ANSWER:
In the custom primitive function you define, the parameters you are passing to rank
are different than the parameters you are using when you call rank
directly on the DataFrame.
When calling directly on the DataFrame you are using the following parameters:
.rank(method="min", ascending=False, numeric_only=True)
In the custom primitive function you are using different values:
.rank(method="dense", ascending=True)
If you update the primitive function to use the same parameters, the results you get from Featuretools should match what you get when calling rank
directly on the DataFrame.