Search code examples
datasetbenchmarkingfuzzy

Datasets for benchmarking Fuzzy Clustering method with millions of data


We want to test the performance of some fuzzy clustering algorithms that some collaborators have developed. Our interest lies in 2D datasets with a lot of data, where we could benchmark these algorithms. Do you know where can one find such datasets?


Solution

  • One excellent dataset is the one provided by this website. StackExchange provides an anonymized dump of all publicly available data found on their sites here: https://archive.org/details/stackexchange

    You can read about the data schema here: https://meta.stackexchange.com/questions/2677/database-schema-documentation-for-the-public-data-dump-and-sede

    I have a copy of the data from a year ago and it has over 16 million records just for this site (StackOverflow.com) and the dump has all of their sites.