Search code examples
algorithmbitbranchless

Finding YAML single quotes branchlessly


So I have the following problem, in YAML single quotes are a strange beast.

Problem

Short primer: a YAML single quotes spans one line, and has no escape character. Instead the ' character is doubled ('') to escape it.

'I''m a string' # parses to Scalar value of "I'm a string"

However, if the string is empty, then it's not treated as escape.

'' # parses to Scalar value "" (empty string)

Which can be a problem. Consider following line:

{'':''''} # parses to Map {"": "\'"}

When applied to a bitmask you get

bytes { ' ' : ' ' ' ' }
bitmask 0 1 1 0 1 1 1 1 0
relevant quotes 1 1 1 1

But in another example

''' '  # Parses to Scalar "\' "

Gets following bitmaks

bytes ' ' ' '
bitmask 1 0 1 1 1
relevant quotes 1 1

In second example the 3rd and 4th byte are ignored because they are inside an existing single quote string i.e they are treated as '' escape for '. And this logic depends on how many single quotes were beforehand.

How to express this logic branchlessly if bit state depends on previous bytes being in/out single quote?

What I tried

I tried several ways to calculate, theses bit values, but I'm kinda stuck, the furtherest I got was logic that can select the bits correctly in case #1. Logic is bellow in pseudo Rust code

fn find_relevant_bytes(bytes: u8) -> u8 {
   let  low_part  = bytes & bytes >> 1;
   let  high_part = bytes & bytes << 1;
   
   return low_part ^ high_part;
}

So what you get if you run them is

find_relevant_bytes(0b0110_1111) // returns `0b0110_1001` as expected
find_relevant_bytes(0b00010111) // returns `0b00000101` instead of `0b00010001`

However this will have problem in second case. So I used some odd/even tricks

let even = bytes & 0x55; // i.e. 0101_0101
let odd = bytes & 0xAA;  // i.e. 1010_1010
let remove_pairs = even ^ (odd << 1);

But these are position-dependent. I'd need to unify it somewhat because a '''' can start on odd or even positions. Not so much annuling, as getting the one ' that follows them.

Naive solution

The naive solution to the problem is somewhat like following:

use crate::QuoteState::*;

fn main() {
    println!("{:#010b}", find_quotes(0b10011111));
}

pub enum QuoteState {
    OutsideQuote,
    InQuote,
    InQuoteQuote,
}

pub fn find_quotes(mask: u8) -> u8 {
    let mut quotes_state = QuoteState::OutsideQuote;
    let mut pos = 0;
    let mut quotes = 0;

    loop {
        if pos >= 8 {
            break
        }
        
        let found_quote = mask & (1<< pos) != 0;
        quotes_state = match quotes_state {
            // If not in quote, switch to quote state
            OutsideQuote if found_quote => {
                quotes |= 1 << pos;
                InQuote
            },
            // If in quote and found a quote, switch to possible end quote state
            // or escape quote state
            InQuote if found_quote => {
                InQuoteQuote
            }
            // If no quote after a quote was encountered then it's end of the quote
            InQuoteQuote if !found_quote => {
                quotes |= 1 << (pos - 1);
                OutsideQuote
            },
            // If quote after quote, ignore.
            InQuoteQuote if found_quote => {
                InQuote
            },
            e => e
        };
        
        pos += 1;
    }
    // Deal with last quote state assuming 9th bit is not quote.
    if let InQuoteQuote = quotes_state {
        quotes |= 1 << 7;
    }
    quotes
}

Playground link

It's my approximation of the formal version, but it shows the problems of YAML single quotes, they depend on look-ahead, or depend on remembering what characters were encountered.


Solution

  • Disclaimer

    To solve this issue, I had to relax the rules a bit. My goal is still the same find the relevant bits for quote but to solve this issue:

    Which can be a problem. Consider following line:

    {'':''''} # parses to Map {"": "\'\'"}
    
    bytes { ' ' : ' ' ' ' }
    bitmask 0 1 1 0 1 1 1 1 0
    relevant quotes 1 1 1 1

    Getting

    bytes { ' ' : ' ' ' ' }
    bitmask 0 1 1 0 1 1 1 1 0
    relevant quotes 1 1 1 1 1 1

    Would work as well. These two solutions can be transformed from one into another. You can convert edges using the compute_quote_mask function described below, and you can convert consecutive bits via (bits && !(bits <<1)) || (bits && !(bits >> 1))

    Exclusion of the third option

    There can be even or odd number of quotes, zero quotes notwithstanding, they do not count in any way. To find the odd start quotes and odd end quotes, we need to eliminate even quotes. We'll handle this later.

    So:

    let odd_quotes = quotes ^ even_quotes
    

    Once we find the odd quotes, we can find start quotes and end quotes.

    let odd_ends = odd_quotes & !(odd_quotes << 1)
    let odd_starts = odd_quotes & !(odd_quotes >> 1)
    

    Once we have an ends/starts, we can convert them from edges of consecutive bits (e.g. 01000100) into consecutive bits (e.g. 00111100) using compute_quote_mask function:

    fn compute_quote_mask(quote_bits: u64) -> u64 {
        let mut quote_mask: u64 = quote_bits ^ (quote_bits << 1);
        quote_mask = quote_mask ^ (quote_mask << 2);
        quote_mask = quote_mask ^ (quote_mask << 4);
        quote_mask = quote_mask ^ (quote_mask << 8);
        quote_mask = quote_mask ^ (quote_mask << 16);
        quote_mask = quote_mask ^ (quote_mask << 32);
        quote_mask
    }
    
    

    Assume for a moment we have the following bytes

    characters ' ' ' ' ' ' ' '
    quotes 1 1 1 0 1 1 1 1 1
    odd_ends 1 1
    end_quote_mask 1 1 1 1 1 1
    odd_starts 1 1
    starts_quote_mask 1 1 1 1
    desired outcome 1 1 1 1 1 1 1 1 1

    As we can see, our desired outcome is a union of end_quote_mask, start_quote with rightmost 1 just being this union shifted by 1 to the left.

    let quotes = end_quote_mask | starts_quote_mask;
    let quotes = quotes | quotes << 1;
    

    How to find even quotes?

    This is all well and good, but how do we find even quotes then?

    Basically, read up on Parsing Gigabytes of JSON per Second.

    In there, there is an interesting method to find odd double quotes. We can generalize it to find both odd and even quotes.

        fn scan_for_mask(bits: u64, prev_iteration_odd: &mut u32, mask: u64) -> u64 {
            let start_edges = bits & !(bits << 1);
            let prev_iter_odd = u64::from(*prev_iteration_odd);
    
            // flip lowest if we have an odd-length run at the end of the prior iteration
            let even_start_mask = EVEN_BITS ^ prev_iter_odd;
            let even_starts = start_edges & even_start_mask;
            let odd_starts = start_edges & !even_start_mask;
            let even_carries = bits.wrapping_add(even_starts);
    
            // must record the carry-out of our odd-carries out of bit 63; this
            // indicates whether the sense of any edge going to the next iteration
            // should be flipped
            let (mut odd_carries, iter_ends_odd_backslash) = bits.overflowing_add(odd_starts);
    
            odd_carries |= prev_iter_odd;
            // push in a bit zero as a potential end
            // if we had an odd-numbered run at the
            // end of the previous iteration
            *prev_iteration_odd = u32::from(iter_ends_odd_backslash);
            let even_carry_ends = even_carries & !bits;
            let odd_carry_ends = odd_carries & !bits;
            let even_start_odd_end = even_carry_ends & mask;
            let odd_start_even_end = odd_carry_ends & !mask;
            even_start_odd_end | odd_start_even_end
        }
    

    Without going too much into the weeds of this method (read the figure 3 it goes into much more details), it relies on adding a start edge to the quotes bit mask and seeing if the carry from the sum of even subset and bit masks ends on odd bits or vice versa.

    Basically, if carry is odd digit wise, it will start on even digits and end on odd digits and vice versa. However, this can be made work for even carries, since it will start on even(or odd) and end on even (or odd) digits respectively.

    We have even ends now what?

    Sadly, to get even starts, we need to do more. A lot more. I've banged my head against this problem for a solid month. If anyone has a branchless way to find it, I will love if you can share it.

    That said, one month of head banging against this problem, led to an epiphany that I don't need to solve it.

    Rather than getting the end of even quotes, then finding the start of even quotes, then computing all even quotes, I can select all even quotes based on end quotes!

    Here it is:

    pub fn select_consecutive_bits_branchless(input: u64, mask: u64) -> u64 {
        let mut result = 0;
    
        result |= input & mask;
    
        let mut a = input & 0x7FFF_FFFF_FFFF_FFFF;
        result |= (result >> 1) & a;
    
        a &= a << 1;
        result |= ((result >> 1) & a) >> 1;
    
        a &= a << 2;
        result |= ((result >> 1) & a) >> 3;
    
        a &= a << 4;
        result |= ((result >> 1) & a) >> 7;
    
        a &= a << 8;
        result |= ((result >> 1) & a) >> 15;
    
        a &= a << 16;
        result |= ((result >> 1) & a) >> 31;
    
        result
    }
    

    If you give this method an input 0111_0110 and mask 0100_000 it will select all consecutive bits, on the right side of any 1 in mask.

    Now we have even bits! Which is the final part of the puzzle!