Search code examples
stringsplitcamelcasing

How to split words in Camel Case with special capital words inside?


I am trying to make configuration names easier to understand for my deep learning model. The first thing I am supposed to do is to split the configuration names into tokens.
The input is like:

allow-nonxdr-writes
io.native.lib.available
ha.zookeeper.parent-znode
min_file_size
ProxyStatus
ProxyFCGIBackendType
SessionDBDCookieRemove
DBDriver
SSLOCSPDefaultResponder

The corresponding output should be:

allow nonxdr writes
io native lib available
ha zookeeper parent znode
min file size
Proxy Status
Proxy FCGI Backend Type
Session DBD Cookie Remove
DB Driver
SSL OCSP Default Responder

As shown above, the format of the configuration names varies(Since they come from different software of different organizations). For the first 4 names it's okay to split them by delimiter like .,- or _. The last five is quite tough for me to handle. If I split these names just by Camel-Case principal with words started with a capital letter, words with special meanings like FCGI,DBD,DB may be wrongly split.

Is there any good practice suitable to handle this problem? Is building a dictionary manually the only way to this problem?

BTW. This situation only occurs when dealing with configuration names in Apache Httpd.


Solution

  • The following regex pattern seems to get close:

    [-._]|(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])
    

    Explanation:

    [-._]                     split on -, ., or _
    |                         OR
    (?<=[a-z])(?=[A-Z])       split when lowercase precedes and uppercase follows
    |                         OR
    (?<=[A-Z])(?=[A-Z][a-z])  split when uppercase precedes followed by upper-lower
    

    Demo

    The only test input which does not match what you expect is:

    SSLOCSPDefaultResponder
    

    My regex gives:

    SSLOCSP Default Responder
    

    The reason for this is that there is no clear rule by which we would know that a break should occur between SSL and OCSP. If you want that logic, you might need to keep a dictionary of known "words" around which there should be additional splits.