Search code examples
sqlitetcl

Using incrblob to determine byte location in a BLOB segment based on a character location in text string?


My question concerns this scenario. I'm storing text as a BLOB using incremental blob I/O of SQLite in the Tcl interface. Another table in the database contains rows of pointer data that point to segments of this BLOB. Since Tcl channels seek to byte locations and not character locations (which I assume is common, I'm not criticizing Tcl), I need to track the byte_start, char_start, byte_length, char_length of each pointer.

A request comes in to Tcl that indicates an existing pointer (not the actual buffer data, but only the pointer itself) needs to be split into two pointers, starting at some character location. The request comes from the UI and that code knows nothing about the byte locations, and really nothing about the BLOB.

Therefore, I need to extract the segment of the BLOB, convert it to text, and determine the byte length of at least on of the two new pieces, and use that information to determine the start byte and byte length of both new pointers.

I was going to use something like below in SQLite but, according to Dr. Hipp, the whole BLOB must be read into memory to use substr on a BLOB.

select
   octet_length(
      substr(
         cast(
            substr(
               buffer,
               byte_start,
               byte_length
            ) as text
         ),
         1,
         :len
      )
   )
from
   mem.pt_buffers
where
       doc_id = :doc_id
   and buffer_id = :buffer_id_s
;

Therefore, I'm using incrblob in Tcl which this example illustrates. It appears to generate the correct results but it seems like a lot of work, too. For example, the content is being extracted from the BLOB only to determine the byte postion of the character position of the split point. The content is not used otherwise, nor is the BLOB being modified. If there were not the issue of byte position being unknown to the UI, the content would not need to be extracted and the operation would be arithmetic only.

My questions are:

  1. Am I doing this the hard way; is there a simpler approach?
  2. Can/should more of it be done directly in SQLite?

Thank you for considering my question.

package require sqlite3
sqlite3 db
db eval {create table test (id integer, data blob);}
db eval {insert into test values (1, zeroblob(100));}
puts [db eval {select id, cast(data as text) from test;}]
# 1 {}

# Previously, a request had to come in to append this string
# to the BLOB; to the non-zero portion, anyway. This is re-
# quired set-up for the question.
set fdBlob [db incrblob main test data 1]
chan configure $fdBlob -translation binary -buffering none
set bindata [encoding convertto utf-8\
     {This is some הַ / נָּבִ֑יא text cast as a BLOB.}]
chan puts -nonewline $fdBlob $bindata

puts [db eval {
   select
      length(data),
      length(cast(data as text)),
      length(:bindata)
   from
      test
   ;
}]
# 100 47 57

# Pre-split pointer data:
# piece  char_start char_length byte_start byte_length
# -----  ---------- ----------- ---------- -----------
# orig        0          47           0         57

# Request comes in to split the pointer at the 27th character
# into two pointers: characters 0-26 and 27-end.

# Retrieve the segment of the BLOB. Have to retrieve the full
# piece because do not know where to split the bytes. Convert
# from binary to text. Note [chan read numChars] reads chars,
# but since channel is configured as binary, same as bytes.
chan seek $fdBlob 0 start
set data [encoding convertfrom utf-8 [chan read $fdBlob 57]]

# Split the piece. Need only one or the other, not both.
set frontEnd [string range $data 0 26]
set tailEnd [string range $data 27 end]
puts "\"$frontEnd\" \"$tailEnd\""
# "This is some הַ / נָּבִ֑יא " "text cast as a BLOB."

# Convert the substring back to binary and determine byte length.
set frontEndByteLen [string length [encoding convertto utf-8 $frontEnd]]
set tailEndByteLen [expr {57-$frontEndByteLen}]
puts "Front-end data: byte_start: 0 byte_length $frontEndByteLen"
puts "Tail-end data: byte_start: $frontEndByteLen byte_length $tailEndByteLen"
# Front-end data: byte_start: 0 byte_length 37
# Tail-end data: byte_start: 37 byte_length 20

# Test it out by seeking to the start byte and extracting the tail-end.
chan seek $fdBlob $frontEndByteLen start
set data [encoding convertfrom utf-8 [chan read $fdBlob $tailEndByteLen]]
puts $data
# text cast as a BLOB.

# Then the pointer table will have these new pieces inserted.
# piece  char_start char_length byte_start byte_length
# -----  ---------- ----------- ---------- -----------
# front       0          27           0         37
# tail       27          20          37         20

EDIT: It took awhile for me little brain to process the closing statement of the answer, though it is simple enough:

Another consideration is that, while chan seek operates on byte offsets, chan read works in characters, so it might be easier to fetch the pieces that way (particularly if it's the first part you want - just seek to 0 and read that many characters)

I knew that as commented in the original post but did not have the sense to employ it properly. All that was needed was to re-configure the channel to -encoding utf-8, chan seek on bytes and then chan read on chars. Therefore, the front-end segment can be read directly in incrblob i/o without the need to read the full piece and split using string range in Tcl. I was stupidly thinking that the blob has to read as binary. That eliminates a bit of work.

# Version 2
set frontEnd_byte_start 0
set frontEnd_char_length 27
chan configure $fdBlob -encoding utf-8 -buffering none
chan seek $fdBlob $frontEnd_byte_start start
set frontEndText [chan read $fdBlob $frontEnd_char_length]
set frontEnd_byte_length [string length [encoding convertto utf-8 $frontEndText]]
set tailEnd_byte_start [expr {$frontEnd_byte_start + $frontEnd_byte_length}]
puts "frontEnd text: $frontEndText"
puts "frontEnd byte length: $frontEnd_byte_length"
puts "tailEnd byte start: $tailEnd_byte_start"

chan seek $fdBlob $tailEnd_byte_start start
set data [chan read $fdBlob 20]
puts $data

# frontEnd text: This is some הַ / נָּבִ֑יא 
# frontEnd byte length: 37
# tailEnd byte start: 37
# text cast as a BLOB.

Solution

  • Your example is how I would approach it, particularly if the character offset is coming from a Tcl UI. Since sqlite is in-process there aren't any considerations of network latency such - it's really just whether the character offset calculations are being performed by sqlite's code or Tcl's. Tcl is really pretty good at that (being really into strings and everything).

    Particularly if the UI is Tcl, since the character indexes will agree under situations like surrogate pairs, etc.

    It may be more efficient to just select out the blob value, split with string range commands and update the row, but that will depend on specifics on how the channel wrapping of the db value is implemented and what complexity it implies. Best to benchmark your actual case to know. Another consideration is that, while chan seek operates on byte offsets, chan read works in characters, so it might be easier to fetch the pieces that way (particularly if it's the first part you want - just seek to 0 and read that many characters)