Search code examples
c++algorithmrandomprng

How many random number generators should I use?


If different classes of an application need to extract one or more random numbers, where should a random number generator be initialized in order to produce good random sequences?

In particular, I need to build some decision trees in order to train a random forest. The construction of each decision tree involves the following steps:

  1. The dataset (organized on multiple rows of data) is loaded.
  2. Some rows in this dataset are randomly selected in order to build a new dataset. This new dataset will be gradually splitted during the growth of the tree.
  3. This new dataset is used in order to grow a decision tree: the creation of each node needs the random selection of a few rows of this new dataset (before creating one node, you have to randomly generate some small different subsets of this new dataset).

The three steps listed above are performed for the construction of each decision tree. The procedure just described provides that the random number generation occurs several times. For example the second step should ensure that each decision tree is trained with a dataset slightly different from the initial one, so the random number generator should avoid the generation of equal datasets (or in any case the likelihood of this occurring should be very low).

In essence, in this procedure we can identify two sources of randomness:

  • the generation of N random dataset, each to train a single decision tree;
  • before you create a node, you must perform M random extractions from a given dataset.

How many random number generators should I use? Since I have a class that implements the random forest, and another class that implements the decision tree, I thought I'd initialize a random number generator in the first class (the first source of randomness), and another random number generator in the second class (the second source of randomness). Is this correct?

In general, what are the guidelines for choosing the correct number of pseudo-random number generators?


Solution

  • Depends on how repeatable you need the sequence to be. e.g. if you can't guarantee the order that the rand() calls are made in, and need to generate the same sequence each time for testing, then you'd need a separate seed/generator for each of these queues.

    If you don't care for repeatability, then just have one generator, one seed, and let it run.