How can I identify a subset of a dataset which is indicative of the dataset as a whole?

I have a two datasets: one with a list of businesses and one with a list of reviews for those businesses (primary key is the business ID). The review dataset is large with ~4 million values, and each business may have as low as 0 reviews or as much as 100s of reviews. I am looking to create a word cloud or unique word counter for each business, but there are too many reviews for my computer to locally handle. Is there a way to make the dataset smaller that does not compromise its integrity? Can I choose a maximum of 50 reviews for each business, for example?

Solution

What you are looking for is a representative sample without selection bias. There are several methods to select your sample. Check this link https://humansofdata.atlan.com/2017/07/6-sampling-techniques-choose-representative-subset/ for some ideas.

gobject/gnome/glib bindings for D using GIR?
How do nested functions get compiled?
The "this" pointer and message receiving in D
64-bit executables with DMD
GtkD with D lang on Fedora
Why Android used Java concept instead of D language or C or C++? But Chromium web browser is in C++, its very complicated match
How to use MongoItemWriter to write a List<T>
Why a function with protected modifier can be overridden and accessible every where?
Convert Unicode const(uint)* to a dlang character type
Compiling D with Code::Blocks
DMD vs. GDC vs. LDC
Rendering a font in raylib using freetype
Digital Mars D compiler; acquiring ASM output
D Programming: openssl rsa forward reference compiler error
D compiler DMD doesn't link object files
OPTLINK: Warning 23: No Stack
Is there a limit in the amount of temporary generated symbols during a project build using dmd 2.063?
Is this the right way to combine Garbage collected with none Garbage collected code in D
D compiler (Digital Mars D Compiler) throwing error
Which D Compiler to Use?
Splitting a string treating multiple whitespace as one separator
Proper way of passing array parameters to D functions
Detailed Valgrind internals documentation
Is it possible, in D, to tell the garbage collector to not scan a particular pointer (or anything below it)?
Iterate over key/value pairs in associative array in D.
Dlang associative array of an array of strings keyed by a string has unexpected behavior
How to repeat a statement N times (simple loop)
Is worth the effort to learn D?
ld: undefined reference to object I can see in objdump
D using emplace