Search code examples
pythonapinlpnltkgensim

API calls from NLTK, Gensim, Scikit Learn


I plan to use NLTK, Gensim and Scikit Learn for some NLP/text mining. But i will be using these libraries to work with my org data. The question is while using these libraries 'do they make API calls to process the data' or is the data taken out of the python shell to be processed. It is a security question, so was wondering if someone has any documentation for reference.

Appreciate any help on this.


Solution

  • Generally with NLTK, gensim, and scikit-learn, algorithms are implemented in their source code, and run locally on your data, without sending data elsehwere for processing.

    I've never noticed any documentation/functionality of these packages mentioning a reliance on an remote/cloud service, nor seen users discussing the same.

    However, they're each large libraries, with many functions I've never reviewed, and with many contributors adding new options. And I don't know if the project leads have stated an explicit commitment to never rely on external services.

    So a definitive, permanent answer may not be possible. To the extent such security is a concern for your project, you should carefully review the documentation, and even source code, for those functions/classes/methods you're using. (None of these projects would intentionally hide a reliance on outside services.)

    You could also develop, test, and deploy the code on systems whose ability to contact outside services is limited by firewalls – so that you could detect and block any undisclosed or inadvertent communication with outside machines.

    Note also that each of these libraries in turn relies on other public libraries. If your concern also extends to the potential for either careless or intentionally, maliciously-inserted methods of private data exfiltration, you would want to do a deeper analysis of these libraries & all other libraries they bring-in. (Simply trusting the top-level documentation could be insufficient.)

    Also, each of these libraries have utility functions which, on explicit user demand, download example datasets or shared non-code resources (like lists of stopwords or lexicons). Using such functions doesn't upload any of your data elsewhere, but may leak that you're using specific functionality. The firewall-based approach mentioned above could interfere with such download steps. Under a situation of maximum vigilance/paranoia, you might want pay special attention to the use & behavior of such extra-download methods, to be sure they're not doing any more than they should to change the local environment or execute/replace other library code.

    Finally, by sticking to widely-used packages/functions, and somewhat older versions that have remained continuously available, you may benefit from a bit of "community assurance" that a package's behavior is well-understood, without surprising dependencies or vulnerabilities. That is, many other users will have already given those code-paths some attention, analysis, & real-usage – so any problems may have already been discovered, disclosed, and fixed.