What I have so far is putting the text into CDATA tags, and dealing with the possibility of CDATA endings appearing in the text by splitting it into multiple adjacent CDATAs.
I'm not sure about this, but XML parsers can fail to preserve newlines inside of CDATA tags, correct? This would mean escaping them somehow as well...
I want to generate these XML files using Perl, and parse them with C++ (using expat), Java, and C#.
Most importantly, I want the resulting files to be somewhat human-readable/modifiable. Does anyone know of any encoding scheme that fits these needs? I am using this to store data for a database, so it needs to accept arbitrary text, and upon parsing return the exact same text.
xml already supports this, you do not need to do anything special and you certainly do not need to use CDATA. just use a decent library, make sure you are using UTF-8 encoding, and add a text node. if something is "losing" newlines then it's a bug. xml already has an "encoding" (escaping) that is relatively human readable. it's also standard which makes it much more useful than inventing your own.
see, for example https://stackoverflow.com/a/1140802/181772