Search code examples
rustlanguage-detectioncld2

Within-string detection of multiple human languages in Rust?


I've found that in Python, according to this example on this page, the pycld2 package is able to detect changes of language within strings.

I have tested that example. It works. I then modified that string to put different English and different French. It worked again.

But I want to do this in Rust.

On the Rust cld2 page it says "DEPRECATED in favor of whatlang, which is native Rust and smaller. If you have a compelling use-case for this code, please open an issue."

I've now used whatlang in Rust. By default it doesn't seem to be able to split strings into sections in different detected languages. And there doesn't seem to be any talk on that page about this capability.

This seems indeed to have been an integral part of the Python CLD project from 2013 at least:

An option to identify which parts (byte ranges) of the text contain which language, in case the application needs to do further language-specific processing. From Python, pass the optional returnVectors=True argument to get the byte ranges, ...

Is there any way of accomplishing in-string language differentiation in Rust?


Solution

  • Promising answer with Lingua-rs: "10.6 Detection of multiple languages in mixed-language texts".

    Said to be an experimental feature, but works out of the box.

    First results actually pretty/very impressive.