Search code examples
batch-filesslscriptingopensslcertificate

Automate Splitting a PEM File into multiple Certs


I need to figure out a way to automate the process of splitting a PEM file into multiple PEM files. I was thinking of utilizing a batch script that will grab the PEM and Separate every time it finds:

-----BEGIN CERTIFICATE-----

To

-----END CERTIFICATE-----

However this seems a bit "hacky." I was hoping OpenSSL would have a tool that might be able to do this but I can't seem to find anything.

What would be the best way of doing this?


Solution

  • I was hoping OpenSSL would have a tool that might be able to do

    I'm not aware of an OpenSSL function or OpenSSL tool to do it. Looking at the sources, PEM_bytes_read_bio may be the function to do it. But its not documented, so I'm not certain. (The function name begins with capitol letters - PEM_*. The various lower letters - pem_* are private and should not be used).

    If you have the OpenSSL sources handy, then the source code for the parsing routine is in <openssl src>/crypto/pem/pem_lib.c. That's where PEM_bytes_read_bio is implemented.


    However this seems a bit "hacky."

    Well, its not so much hacky - you have to roll up your sleeves and code it up. You might be able to use Bison and Flex to create a parser and lexer. How you call it from the shell is a different story. With a lexer, I think you can parse a PEM object in O(n).


    I need to figure out a way to automate the process of splitting a PEM file into multiple PEM files... What would be the best way of doing this?

    I wrote similar for Crypto++ at PEM Pack. It added support for PEM encoded keys, including encrypted keys. Crypto++ is a C++ library, but the same general algorithm should work well with your language of choice.

    The routine of interest in Crypto++ is called PEM_NextObject, its located in source file pem-rd.cpp. You can find the source files at the bottom of the page in a ZIP file. PEM_NextObject looked for four items:

    • The leading -----BEGIN
    • the following -----
    • The trailing -----END
    • the following -----

    I used four indexes - one for each token. I would read 64+1 bytes at a time because OpenSSL outputs its break at 64 characters. I would read a line into a string and concatenate the string into an accumulator. I would then use find to locate the token in the accumulator (some hand waiving, because they were secure strings). If I did not find a particular index, I would read another line.

    When searching for the token, the search for the first token started at position 0. The next search started after the previous index was found. For example, the search for index two began at index one plus the size of the token; and the search for index three began at index two plus the size of the token. If a token was not found, I only search the current line and 10 character proceeding it in case the token spans a previous read and current read.

    I used indexes rather than iterators because an iterator is invalidated if the container's size was increased. The concatenation would have caused that. Fortunately, the index was always valid because it was simply an offset from the beginning of the string. You may not have this problem in bash (or whatever you choose).

    If I read to the end of a stream without finding all four indexes, then I threw an error.

    If I found all four indexes, then I had something that claimed to be PEM encoded. I discarded any leading characters, and trimmed trailing whitespace. So the PEM object was located at (Index1) to (Index4 + 5) (+5 for the trailing -----).

    Because I might have parsed an invalid PEM object (i.e., -----BEGIN FOO----- and -----END BAR-----), I needed another routine to classify the type of PEM object that was parsed. That function is called PEM_GetType.

    The algorithm should work well because its not egregious from an algorithm analysis point of view and PEM objects are usually small (less than 2K or 4K). I think the analysis is O(n + m*10), where m is the number of lines in the file. The m*10 is based on scanning a 64 character line looking for a token with a 10 character "rewind", reading another line, and then scanning for the token again. Recall I "rewind" a bit in case the token spans lines.

    This algorithm performs OK if there's no PEM object and the file is large. I'm pretty certain it runs in O(n + m*10) in the worse case, too. If n >>> m, then its essentially a O(n) function because m*10 is just a large bounding c.

    You might also be interested in How to split a PEM file on Server Fault and Where is the PEM file format specified? on Stack Overflow.


    -----BEGIN CERTIFICATE----- to -----END CERTIFICATE-----

    While you show a certificate, there are other types of objects. For example, public keys and encrypted private keys. If you need to decrypt an encrypted key, then you will need to lift/borrow/use OpenSSL's EVP_BytesToKey.

    EVP_BytesToKey is kind of non-satndard, so it becomes a copy/paste operation to ensure interoperability. I seem to recall EVP_BytesToKey is equivalent to PKCS#5 derivation if the number of bytes produced by EVP_BytesToKey is 16 or less. If 17 or more are produced, then OpenSSL uses a "non-standard" extension.


    If you are interested in testing, then take a look at pem-create-keys.sh. It creates malformed PEM encoded keys (not certificates). For example, it will concatenate multiple keys without line breaks, it will delete one of the trailing dashes, and it will delete one of the trailing dashes and then concatenate another key.