Search code examples
sqlrdbplyr

randomly sample a database using sample_n() with dbplyr


Is it possible to select a random (or pseudo-random) subset from a database using a function like dplyr::sample_n() but in dbplyr or another R package that runs SQL queries?

The purpose is to test queries on small batches before running a time-intensive data pull from the database.


Solution

  • This seems to be working for our MySQL server:

    dbGetQuery(con, 
    "SELECT * 
    FROM data_table_name 
    ORDER BY RAND() 
    LIMIT 500;")