Search code examples
sqlitetcl

Efficient method of splitting strings into segments of equal length?


Is there a fast way to split a string into segments of equal length in SQLite and/or Tcl?

My goal is to take a string in memory and write the splits to rows in a table. Sometimes the string is passed via a channel and other times it is concatenated from fragments already held in SQLite tables.

I've been using a CTE with substr and the best I can get is about 1.6 seconds for a string of about 1 MB into 4216 pieces/rows.

The simplest split attempt example is

select
   sg.value, substr(:textContent, sg.value, 250)
from
   generate_series(1, :tcLen, 250) sg
;

It takes only 0.129 seconds to combine those 4216 pieces from the table using group_concat and that includes a substr on every row to select either all or part of each row's string.

Thank you for any guidance you may be able to provide.

EDIT:

I now see that this answer mentions :textutil::spit::splitn in tcllib. Will give it a try; should've thought of looking there but assumed SQL would be quicker.

Using the text utility as below and inserting within the loop, reduced the 1.6 seconds to around 0.26 seconds.

foreach row [::textutil::split::splitn $textContent $bufferRowMaxLength] {
  set charLength [expr {min([string length $row], $bufferRowMaxLength)}]
  db eval {
    insert into mem.pt_buffers values ( ... );
    insert into mem.pt_pointer values ( ... );
  }
}

Solution

  • textutil::split::splitn is very fast, even though it's just the obvious Tcl script for that solution, about a factor of 2 slower than the best I could do with c. Shouldn't really be too surprising, Tcl is really good at dealing with strings.

    One alternative I tried (based on a trick Donal shared ages ago), was chunking it with regexp:

    set chunks [regexp -all -inline .{1,$bufferMaxRowLength} $textContent]
    

    but this was surprisingly slow.

    Here are all my results:

    package require textutil
    package require jitc 0.5.3
    
    set str [string repeat a 1048576]
    
    set bufferRowMaxLength  250
    
    set split_cdef {
        code {
            #include <string.h>
    
            OBJCMD(split) {
                int         code = TCL_OK;
                Tcl_Obj**   chunkobjs = NULL;
                Tcl_Obj*    res = NULL;
    
                enum {A_cmd, A_STR, A_CHUNK, A_objc};
                CHECK_ARGS_LABEL(finally, code, "str chunk");
                int len;
                const char* str = Tcl_GetStringFromObj(objv[A_STR], &len);
                int chunk;
                TEST_OK_LABEL(finally, code, Tcl_GetIntFromObj(interp, objv[A_CHUNK], &chunk));
                const int chunks = (len + chunk - 1) / chunk;
                chunkobjs = ckalloc(chunks * sizeof(chunkobjs[0]));
                for (int i=0; i<chunks; i++) {
                    const int start = i * chunk;
                    const int end = start + chunk;
                    const int clen = end < len ? chunk : len - start;
                    chunkobjs[i] = Tcl_NewStringObj(str + start, clen);
                }
                replace_tclobj(&res, Tcl_NewListObj(chunks, chunkobjs));
                Tcl_SetObjResult(interp, res);
    
            finally:
                replace_tclobj(&res, NULL);
                if (chunkobjs) ckfree(chunkobjs);
                return code;
            }
    
            OBJCMD(foreach_chunk) {
                int         code = TCL_OK;
    
                enum {A_cmd, A_STR, A_CHUNK, A_VAR, A_SCRIPT, A_objc};
                CHECK_ARGS_LABEL(finally, code, "str chunk var script");
                int len;
                const char* str = Tcl_GetStringFromObj(objv[A_STR], &len);
                Tcl_WideInt chunk;
                TEST_OK_LABEL(finally, code, Tcl_GetWideIntFromObj(interp, objv[A_CHUNK], &chunk));
                const int chunks = (len + chunk - 1) / chunk;
                for (int i=0; i<chunks; i++) {
                    const int start = i * chunk;
                    const int end = start + chunk;
                    const int clen = end < len ? chunk : len - start;
                    if (NULL == Tcl_ObjSetVar2(interp, objv[A_VAR], NULL, Tcl_NewStringObj(str + start, clen), TCL_LEAVE_ERR_MSG)) {
                        code = TCL_ERROR;
                        goto finally;
                    }
                    TEST_OK_LABEL(finally, code, Tcl_EvalObjEx(interp, objv[A_SCRIPT], 0));
                }
    
            finally:
                return code;
            }
        }
    }
    
    jitc::bind jitc_split         $split_cdef split
    jitc::bind jitc_foreach_chunk $split_cdef foreach_chunk
    
    set chunk   .{1,$bufferRowMaxLength}
    
    puts "regexp: [timerate {
        set pieces [llength [regexp -all -inline $chunk $str]]
    }], pieces: $pieces"
    
    puts "splitn: [timerate {
        set pieces  [llength [::textutil::split::splitn $str $bufferRowMaxLength]]
    }], pieces: $pieces"
    
    puts "jitc split: [timerate {
        set pieces [llength [jitc_split $str $bufferRowMaxLength]]
    }], pieces: $pieces"
    
    puts "jitc foreach_chunk: [timerate {
        set pieces  0
        jitc_foreach_chunk $str $bufferRowMaxLength chunk {
            incr pieces
        }
    }], pieces: $pieces"
    
    regexp: 348330.7 µs/# 3 # 2.871 #/sec 1044.992 net-ms, pieces: 4195
    splitn: 567.176 µs/# 1764 # 1763.1 #/sec 1000.498 net-ms, pieces: 4195
    jitc split: 235.065 µs/# 4255 # 4254.1 #/sec 1000.201 net-ms, pieces: 4195
    jitc foreach_chunk: 463.043 µs/# 2160 # 2159.6 #/sec 1000.173 net-ms, pieces: 4195
    

    That the regexp implementation was so much slower than the others suggests there is some room for improvement in its implementation, some non-obvious inefficiency.

    But for me the takeaway is that, as long as you have the constraint of assembling a Tcl list with the chunk contents, then ::textutil::split::splitn (or the equivalent few lines of Tcl) is probably the right way to go. The added complexity of a c implementation doesn't win enough in my book to justify itself, and the gains would almost certainly be completely washed out by any work done on the chunks themselves.

    Note that the c implementations aren't equivalent to the Tcl or regexp ones here - they split into chunks of $bufferMaxRowLength bytes, not characters. Assuming the value being split is a string (rather than a bytearray, in which case the implementations are equivalent), then the chunks produced by the c implementations are chunks of the bytes representing the string in CESU-8 encoding (like UTF-8, but encoding the null byte as 0xC0 0x80, and using surrogate pairs for characters outside of the BMP). Those byte chunk boundaries could split the encoding of a single character, so the chunks themselves would not be a valid encoding of a string. Their purpose is as a stand-in for the minimum work needed to compute this result, and any implementation that splits on characters would have to do their work plus account for character boundaries, so they're valid as a lower-bound estimate of the work.