I am trying to use NLTK in browser, thanks to pyodide. Pyodide starts well, manages to load NLTK, print its version.
Nevertheless, while the package downloading seems fine, when invoking nltk.sent_tokenize(str)
, NLTK raises the error that it can't find the package "punkt".
I would say the downloaded resource is lost somewhere, but I didn't understand well how Pyodide / WebAssembly manage files. Any insights ?
Simple version:
import nltk
nltk.download(pkg)
for sent in nltk.sent_tokenize("Test string"):
print(sent)
Version with more details, specifying download directory and server url.
import nltk
pkg = "punkt"
downloader = nltk.downloader.Downloader(server_index_url="https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml")
downloader.download(pkg, download_dir='/nltk_data')
downloader.status(pkg)
for sent in nltk.sent_tokenize("Test string"):
print(sent)
Full sample code:
<!DOCTYPE html>
<html>
<body>
<script type="text/javascript" src="https://cdn.jsdelivr.net/pyodide/v0.18.0/full/pyodide.js"></script>
<script type="text/javascript">
// init Pyodide
async function pyodide_loader() {
let pyodide_premise = loadPyodide({
indexURL: "https://cdn.jsdelivr.net/pyodide/v0.18.0/full/",
});
let pyodide = await pyodide_premise;
await pyodide.loadPackage("micropip");
await pyodide.loadPackage("nltk");
return pyodide_premise;
}
let pyodideReadyPromise = pyodide_loader();
// run Python code and load NLTK
async function load_packages() {
let pyodide = await pyodideReadyPromise;
let output = pyodide.runPython(`
print(f"*** import nltk")
import nltk
print(f"*** NLTK version {nltk.__version__=} imported, downloading resources now")
pkg = "punkt"
nltk.download(pkg)
str = "Just for testing"
for sent in nltk.sent_tokenize(str):
print(sent)
`);
}
load_packages()
</script>
</body>
</html>
Short answer is that downloading files with Python currently won't work in Pyodide because http.client
, requests
etc require POSIX sockets which are not supported in the browser VM.
It's curious that nltk.download
doesn't error though -- it should have.
The workaround is to manually download the needed resources, for instance, using the JavaScript fetch API as illustrated in this comment;
from js import fetch
response = await fetch("<url>")
js_buffer = await response.arrayBuffer()
py_buffer = js_buffer.to_py() # this is a memoryview
stream = py_buffer.tobytes() # now we have a bytes object
# that we can finally write under the appropriate path
with open("<file_path>", "wb") as fh:
fh.write(stream)
I didn't understand well how Pyodide / WebAssembly manage files.
By default it's virtual file-system (MEMFS) that gets reset at each page load. You can access it with standard python tools (open
, 'os', etc). If necessary you can also mount a persistent filesystem.