Search code examples
pythonpysparkrecord-linkagepython-dedupe

Increase max_components variable in dedupe library


How can I increase default value in max_components variable?

By default max_components is set to 30000. I need increase this limit because every time I do deduplications (using the same datasets) I have different results.

I think that the total amount of clusters in my data is bigger than 30000.


Solution

  • Answer from Github

    Issue in dedupe github Increase max_components = 30000

    If you are getting different results using same saved settings file, then what you reporting is a bug. If you are getting different results from different training data (or even the same training data), that's expected as at various points dedupe uses a random sample to learn good rules.

    In either case, I doubt that max_components is related. But, if you want to change it, fork the code and change it.