I'm trying to find rows in some HuggingFace datasets which have a certain word or phrase in a specific column (e.g., "output"). I've written some basic Python code to do this, but I noticed an issue at the API level when trying to get a slice of the data that contains every row with one of the words from my phrase. (As I understand it, with the API you can't even search for rows that contain both words—only OR searching.)
Specifically, when making calls to the HuggingFace Datasets-server API /search
endpoint:
A search for used, a word which we'd expect to see frequently in the data: https://datasets-server.huggingface.co/search?dataset=teknium%2FGPT4-LLM-Cleaned&config=default&split=train&offset=0&length=100&query=used
A search for spices, an arbitrary less common word: https://datasets-server.huggingface.co/search?dataset=teknium%2FGPT4-LLM-Cleaned&config=default&split=train&&offset=0&length=100&query=spices
I thought these searches would return all the rows with the specified word (and as a bonus, stemming forms like used to use) in any column. (Note that "num_rows_total" at the end shows how many total rows are found for each search, but the amount per page is limited to 100.) However, as you can see, I didn't get all the rows containing "used" in the first search and as a result I wonder if any of the searches made through this type of call are actually returning all the relevant results (which is essential).
I've noticed this same type of behavior with other datasets too. Often no rows are returned for searches that I know should return rows. They're all large datasets (10k to 1m rows) which were "Auto-converted to Parquet" according to their listings on the HuggingFace website.
I'd like to avoid looping over every row in the entire dataset for every phrase that I decide to look up this way, and I don't want to download these datasets either. Is there a modification that I can make to the API call to make it work? Or maybe a different call would work better?
Lines recently added to the documentation explain:
If the result has
partial: true
it means that the search couldn’t be run on the full dataset because it’s too big.Indeed, the indexing for
/search
can be partial if the dataset is bigger than 5GB. In that case, it only uses the first 5GB.
The exact same thing is true of /filter
, and the numbers returned by /size
are affected as well.