web-crawler artificial-intelligence open-source

How can I protect open source against (mis)use by AI?

As of 2023, there's a bunch of (Generative) AI available for public use; usually, they're re-constructing the most likely sequence of symbols for a given context.

Which technical means exist to keep well-behaved (generative) AI from using my open source code?

For web content, once can set robots.txt / the robots META tags, which (again, well-behaved) crawlers honour.

Which effective means exist that AI crawlers would honour?

(I've been rephrasing this question, since my earlier wording was not understood too often. I'm OK with anything technical that works - whether metadata, keywords, license files or headers, whatever. No legal advice wanted!)

Solution

In reality you can not effectively safeguard your code (or anything else) from being used and consumed in AI systems. (That's the short and pessimistic answer.)

I still want to give it a shot:

On the technical side:

Most AI systems pick data up by crawling the internet, parsing existing libraries (for example language files, source code repositories, data sets, etc.), databases, etc. Of course this becomes much easier if you are Google or Microsoft and have a search engine with full internet cache located in the basement.

As to my knowledge there are no specific markers or similar (as you mention as a parallel to robots.txt or other), which tells an AI to back off and leave the premises. It is a nice idea, which might gain traction within the next few years. However, it requires a standardized way to store this metadata together with many different types of data on many different platforms and environments.

For now, I guess the only way to limit usage is to lock your code (or other information/text/data) up in a closed repository which has "members only" access. Carefully reading through the license terms of Github and other services out there could be interesting. -- Just because something is closed doesn't necessarily mean that the owner doesn't use the contents for AI purposes. (Github might be fine, but I haven't checked, so I really don't know.)

You might be able to find some nice services out there, which are particularly aware of these issues. This would probably be my best advice.

On the legal side (already mentioned):

Even if you picked a specific license and added an addendum to this stipulating specific limitations towards AI usage, this is no guarantee for anything.

Of course, the AI system owner might do something wrong when violating your license terms, but you would need to proove your case. In an international world, this becomes even more difficult.

Wrapping up:

You would need to combine the two dimensions, so you have a technical dimension preventing AI systems from picking up your stuff, and add a legal dimension to it in the form of a suitable license on top, which you could pull out of your hat in case disaster strikes.

Just my two cents. Hopefully this provides you with some ideas on how to solve the problem.

Cheers! :-)