Search code examples
pythonamazon-s3nlpgensimdoc2vec

S3 object as gensim LineSentence


Is it possible to use a txt or jsonl file in an s3 bucket as the corpus_file input for a gensim Doc2Vec model? I am looking for something of the form:

Doc2Vec(corpus_file="s3://bucket_name/subdir/sample.jsonl")

When I run the above line, I get the following error:

TypeError: Parameter corpus_file must be a valid path to a file, got 's3://bucket_name/subdir/sample.jsonl' instead.

I have also tried creating an iterator object that iterates through the file and yields its lines, and passing it as the corpus_file argument. But I get the same TypeError.

Please note that I am specifically looking to use the corpus_file argument instead of the documents.


Solution

  • The corpus_file mode requires random-seek access to the file for its technique, which involves every worker thread opening its own unique file view on distinct ranges of the file. Such access is not well-supported for S3 (HTTP GET) access.

    To use corpus_file mode, download the file to a local volume whose filesystem offers efficient seek access.

    Or, supply things as a corpus iterable - which can re-iterate over a remote streamed file multiple times, but won't achieve the same high thread utilization. (From an iteratable, even if you have 16+ cores, you'll usually get optimal throughput with no more than 6-12 worker threads – even if you've eliminated IO & expensive in-iterable preprocesing from the setup. The exact optimal number of workers depends on other model parameters – it's especially sensitive to vector_size, negative, & window.)