Search code examples
rt-teststatistical-test

choosing healthy control from a very large dataset


I would like to run a unpaired t-test. I have a very large dataset of 500,000 participants and only 21 participants have a disease. how can I choose my healthy control from this large dataset?

any thoughts would help. I am using R for analysis


Solution

  • You need to get a random sample of the same size as your control sample, that is, a random sample of size 21. The sample function will help you. Also, you may want to replicate the same ratio of men to women. For example, if there are 10 men and 11 women, then you would have to sample two patients depending on their gender.

    In short, it would be best to use sample to replicate the characteristics of your 21 patients randomly from the large sample of healthy patients.