Search code examples
sqlitetcl

Using a SQLite zeroblob as an append-only text buffer?


I'm trying to build a piece table for editing text in SQLite (through Tcl) and need an append-only text buffer. (I asked a related question earlier, today, and deleted it after realizing that I was just being quite stupid failing to properly encode the text being written to SQLite or set the encoding in the blob channel itself; so, I apologize if I annoyed any of the 18 people kind enough to read that long question.)

It appeared that the way to do this this is to use incremental blob I/O; so, I added a zeroblob of a fixed size and opened a blob channel to it, and started writing text to it. And it is finally working for multi-byte characters.

At some point, the segments of the buffer will have to be read and concatenated into a new document, along with segments from another buffer. There is the buffer itself in binary utf-8 and a table of pieces in which each row has a start position and length, currently all in characters rather than bytes, because the piece table relates to the document also, and all document events are relative to character position and length. (I won't bore you with that, here, but without tracking character position and length, it cannot be determined which pieces are impacted by a new event; at least, I don't know how it can.)

sqlite> select * from pieces;
id  buffer  start  len
--  ------  -----  ---
1     o     0      5  
2     a     68     10 
3     o     28     13 

My question has two parts.

  1. Is using a blob to hold text as an append-only buffer a reasonable approach and is there a better option in SQLite?

My plan, after flipping sides many times, is to track only character values in the piece table because it makes write operations faster and keep the piece table smaller (in columns not rows). When the buffer is "rolled-up", two things take place. One is that it creates a new buffer out of the surviving segments; and it creates a new text document from those segments. It would be efficient, I guess, to extract those segments from the buffer based on start byte and byte length. But the blob can be read into Tcl as text and the string constructed from the piece table's character start and length values (Please see Method 1 below.) Then that string can be written to SQLite as binary. Or, can cast the blob as text and use substr() in SQLite to concatenate the pieces and convert that back to blob also. (Please see Method 3 below).

  1. So, question 2 is, Which is better: tracking start and length in both characters and bytes at each write event, or making these conversions at the time of roll-up? And, if the latter is preferrable, is it better to read the blob into Tcl or cast it as text in SQLite?

I know that "better" questions are closed, but by better I mean most efficient in terms of reading the blob into memory in Tcl and looping through it appending pieces to a list and finally joining them, versus casting the blob as text and doing about the same in SQL with substr() and group_concat(). There may be a another option of looping in Tcl using chan seek on the start byte and chan read numChars on character length (Please see incomplete Method 2.) In that case, might need to track only start byte and not byte length; but how efficient is seeking and reading versus the other methods?

Perhaps, tracking both bytes and characters isn't an issue. The byte position is the end of the blob because this is always an append-only buffer; and, according to SQLite documentation, a blob does not have to be read into memory (as a string does) to determine it's length. (EDIT: That previous statement is inaccurate in that, although it is true concerning blob length itself, the length of a zeroblob will always be the original size; thus, the length of the blob is not the ending position of the last write to it.) And the byte length of a piece is [string length [encoding convertto utf-8 $string]].

There are several items to consider. More data to store, which operations need to be quickest, and whether performing the work in SQLite, where I suspect it is done at a lower-level in C code, is going to be most efficient. All beyond my knowledge and experience, apart from building it three ways and seeing how it works.

Thank you for considering my question and I'd appreciate any guidance you may be able to provide.

require sqlite3
set dbname blobs
if { [catch {sqlite3 db $dbname} result] } {
  chan puts stdout "result: $result"
  exit
}
chan puts stdout "Successfully opened database named '${dbname}'."
chan puts stdout "Version is [ db version ]."

db eval {
   create table if not exists textblobs (
      id integer primary key,
      textblob blob
   );
   insert into textblobs values(5, zeroblob(2000));
   create table pieces (
      id integer primary key,
      start integer,
      len integer
   );
   insert into pieces values
      (1,0,5),
      (2,61,10),
      (3,28,13)
   ;
}
set fdBlob [db incrblob textblobs textblob 5]
chan configure $fdBlob -translation binary -encoding utf-8 -buffering none

set testStr {--inserted Hebrew בְּרֵאשִׁ֖ית of char length 50--}
# NOTE These would be four different
# events passed to this code.
chan puts -nonewline $fdBlob "Hello there"
chan puts -nonewline $fdBlob $testStr
chan puts -nonewline $fdBlob ", friends"
chan puts -nonewline $fdBlob "!"

chan puts stdout "\$testStr: $testStr"
chan puts stdout "\[string length \$testStr\]: [string length $testStr]"
chan puts stdout "\[string length \[encoding convertto utf-8 \$testStr]\]:\
    [string length [encoding convertto utf-8 $testStr]]"

chan puts "\[chan seek \$fdBlob 0 start\]: [chan seek $fdBlob 0 start]"
chan puts stdout "copy \$fdBlob stdout -size 2000 written below"
chan copy $fdBlob stdout -size 2000
chan puts stdout ""

chan puts "\[chan seek \$fdBlob 0 start\]: [chan seek $fdBlob 0 start]"
set readblob [chan read $fdBlob]
chan puts stdout "\[chan read \$fdBlob\]: $readblob"
#chan puts stdout "\[encoding convertfrom utf-8 \$readblob\]:\
   [encoding convertfrom utf-8 $readblob]"

chan puts stdout "\[string length \$readblob\]: [string length $readblob]"
#chan puts stdout "\[string bytelength \$readblob\]:\
   [string bytelength $readblob]"

# Method 1
# Result of query is {0 5 61 10 28 13}.
set SQL {\
select
  group_concat(start || ' ' || len, ' ') as pcMap
from pieces
order by id;\
}

db eval $SQL {
  foreach {start len} $pcMap {
    lappend pieces [string range $readblob $start [expr {$start + $len - 1}]]
  }
}
chan puts "Tcl pieces: [join $pieces {}]"

# Method 2
# Does not work currently because chan seek
# is based on bytes not characters; but could
# build the piece table map based on start
# byte and character length, and loop in Tcl
# similar to Method 1, chan seeling to the 
# start byte and chan reading the char length.
chan seek $fdBlob 62 start
chan puts stdout "chan read of 10 characters from seek 62:\
   [chan read $fdBlob 10]"

# Method 3
# Could also build piece table based on bytes
# because substr works on blobs, the difference
# being start and length refer to bytes rather
# than characters; and would nt need to cast
# blobs as text. The issue may be that cannot
# search blobs when want to search this text;
# or, perhaps, the search terms can be converted
# from text to binary and search the blobs.
set SQL {\
select group_concat(piece,'') as pieces
from
 (
  select substr(cast (textblob as text), start+1, len) as piece
  from textblobs, pieces
 );\
}
db eval $SQL {
  chan puts stdout "SQL pieces: $pieces"
}
chan close $fdBlob
db close;

Results:

Successfully opened database named 'blobs'.
Version is 3.42.0.
id: 5, textblob: 
$testStr: --inserted Hebrew בְּרֵאשִׁ֖ית of char length 50--
[string length $testStr]: 50
[string length [encoding convertto utf-8 $testStr]]: 62
[chan seek $fdBlob 0 start]: 
copy $fdBlob stdout -size 2000 written below
Hello there--inserted Hebrew בְּרֵאשִׁ֖ית of char length 50--, friends!
[chan seek $fdBlob 0 start]: 
[chan read $fdBlob]: Hello there--inserted Hebrew בְּרֵאשִׁ֖ית of char length 50--, friends!
[string length $readblob]: 1988
Tcl pieces: Hello, friends! בְּרֵאשִׁ֖ית
chan read of 10 characters from seek 62: length 50-
SQL pieces: Hello, friends! בְּרֵאשִׁ֖ית

Solution

    1. Is using a blob to hold text as an append-only buffer a reasonable approach and is there a better option in SQLite?

    If you want SQLite to index your string in some way (e.g., with the full text search engine) putting it in a blob won't help. If you can know the whole string before writing it into a blob, it's probably best to do that; on most modern computers, strings of up to, say, 100MB long won't stress things in slightest. That's quite a lot of text, and it is per-string/blob.

    1. Which is better: tracking start and length in both characters and bytes at each write event, or making these conversions at the time of roll-up? And, if the latter is preferrable, is it better to read the blob into Tcl or cast it as text in SQLite?

    You've got database columns, you can keep anything in them that you find helps. If sometimes you need byte ranges and sometimes you need character ranges, keep both. It makes for a bit more complexity and time cost when building, but as long as you want to do many more queries than writes, it makes sense to make writes bear the cost.

    But do check if you need to reinvent your own indexing schemes beyond what the built-in Full Text Search extension virtual tables offer. (No idea how well they handle Hebrew text.)