Search code examples
filesystemsnltkwebassemblypyodide

Pyodide filesystem for NLTK resources : missing files


I am trying to use NLTK in browser, thanks to pyodide. Pyodide starts well, manages to load NLTK, print its version.

Nevertheless, while the package downloading seems fine, when invoking nltk.sent_tokenize(str), NLTK raises the error that it can't find the package "punkt".

I would say the downloaded resource is lost somewhere, but I didn't understand well how Pyodide / WebAssembly manage files. Any insights ?

Screenshot of the error: "Resource punkt not found. Attempted to load tokenizers/punkt/PY3/english.pickle. Searched in: - '/nltk_data'"

Simple version:

import nltk
nltk.download(pkg)
for sent in nltk.sent_tokenize("Test string"):
    print(sent)

Version with more details, specifying download directory and server url.

import nltk
pkg = "punkt"
downloader = nltk.downloader.Downloader(server_index_url="https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml") 
downloader.download(pkg, download_dir='/nltk_data')
downloader.status(pkg)
for sent in nltk.sent_tokenize("Test string"):
    print(sent)

Full sample code:

<!DOCTYPE html>
<html>
  <body>
    <script type="text/javascript" src="https://cdn.jsdelivr.net/pyodide/v0.18.0/full/pyodide.js"></script>
    <script type="text/javascript">
      // init Pyodide
      async function pyodide_loader() {
        let pyodide_premise = loadPyodide({
          indexURL: "https://cdn.jsdelivr.net/pyodide/v0.18.0/full/",
        });
        let pyodide = await pyodide_premise;
        await pyodide.loadPackage("micropip");
        await pyodide.loadPackage("nltk");
        return pyodide_premise;
      }
      let pyodideReadyPromise = pyodide_loader();

      
      // run Python code and load NLTK
      async function load_packages() {
        let pyodide = await pyodideReadyPromise;
        let output = pyodide.runPython(`
print(f"*** import nltk")
import nltk
print(f"*** NLTK version {nltk.__version__=} imported, downloading resources now")

pkg = "punkt"
nltk.download(pkg)

str = "Just for testing"
for sent in nltk.sent_tokenize(str):
    print(sent)
      `);
      }
      load_packages()
    </script>
  </body>
</html>

Solution

  • Short answer is that downloading files with Python currently won't work in Pyodide because http.client, requests etc require POSIX sockets which are not supported in the browser VM.

    It's curious that nltk.download doesn't error though -- it should have.

    The workaround is to manually download the needed resources, for instance, using the JavaScript fetch API as illustrated in this comment;

    from js import fetch
    
    response = await fetch("<url>")
    js_buffer = await response.arrayBuffer()
    py_buffer = js_buffer.to_py()  # this is a memoryview
    stream = py_buffer.tobytes()  # now we have a bytes object
    
    # that we can finally write under the appropriate path
    with open("<file_path>", "wb") as fh:
        fh.write(stream)
    

    I didn't understand well how Pyodide / WebAssembly manage files.

    By default it's virtual file-system (MEMFS) that gets reset at each page load. You can access it with standard python tools (open, 'os', etc). If necessary you can also mount a persistent filesystem.