Search code examples
datasetdata-visualizationsampling

How can I identify a subset of a dataset which is indicative of the dataset as a whole?


I have a two datasets: one with a list of businesses and one with a list of reviews for those businesses (primary key is the business ID). The review dataset is large with ~4 million values, and each business may have as low as 0 reviews or as much as 100s of reviews. I am looking to create a word cloud or unique word counter for each business, but there are too many reviews for my computer to locally handle. Is there a way to make the dataset smaller that does not compromise its integrity? Can I choose a maximum of 50 reviews for each business, for example?


Solution

  • What you are looking for is a representative sample without selection bias. There are several methods to select your sample. Check this link https://humansofdata.atlan.com/2017/07/6-sampling-techniques-choose-representative-subset/ for some ideas.