I am having trouble understanding the RCF algorithm, particularly how it expects / anticipates data or the pre-processing that should be completed? For example, I have the following data/features (with example values) for about 500K records):
The results of my RCF model (trained on 500K records for 57 features - amount, 30 countries dummied, and 26 categories dummied) is extremely focused on the amount feature (e.g., all anomalies are above approx. 1000.00 which absolutely no variation based on the country or type).
Also, I also normalized the amount field and the results for that are also not really that strong. In fact, its safe to say I the results are terrible and I am clearly missing something with this.
Overall, I am looking for some guidance on getting the features right (again - 1 amount field and 2 fields that are categorical and dummied 1 and 0 - resulting in about 57 fields). Im wondering if I am better off with something like kmeans.
EDIT: Some context here... I am wondering:
1) Weighting - Is there is a way to give weight to certain variables (i.e., one of the categorical variables is more important than the other). For example, I am using Country and Category as key attributes and want to give more weight to Category over Country.
2) Context - How can I ensure outliers are considered in context of its peers (the categorical data)? For example, a transaction of $5000 for an "airfare" expense is not an outlier for that category but would be for any other. I could create N numbers of models, but that would get messy and cumbersome, right?
I looked through most of the available documentation (https://docs.aws.amazon.com/sagemaker/latest/dg/rcf_how-it-works.html) and cannot find anything that describes this!
Thank you so much for your help in advance!
EDIT: Not sure its critical at this point where I dont even have semi-reasonable results, but I have used the following hyperparameters:
num_samples_per_tree=256,
num_trees=100
I have never used Amazon RCF, but in general tree based models do not perform particularly well when using One Hot Encoding (or dummy encoding). In that sense, I would rather use a Numeric Encoding (giving numbers from 1 to len(category)) or a Binary Encoder (same thing, but with binary variables). This should allow the trees to have more meaningful splits on those variables.
In terms of Hyperparameters is hard to say, num_samples_per_trees depends on the ratio of outliers you expect to have, while num_trees will impact the amount of data in each partition, and therefore the size of the single trees, so it depends on the size of your dataset.
Try changing these things, and if you see no improvement you can try different stuff. but I suggest DBSCAN over Kmeans honestly, but to my knowledge they all need the definition of some distance or measure between your points, which is not trivial since you are using a mix of categorical an numeric variables
EDIT:
1 - No, I dont think there's a way to weight features in RCF, like usually there's no way to do it in any tree based algorithm as long as I know. However, if you use distance based methods (hierarchical clustering, Kmeans, etc.) you define your own distance metric that weights differently on your features
2 - Well, that's what the algorithm is for. It is supposed to find outliers based on the distribution of all features, not just one.
You can also try Isolation Forest if you want. It does not require any metric and it is easier to understand than RCF in my opinion.