Search code examples
rusttext

How to get char range from byte range


I have an external library whose string representation equivalent to &[char].

Some of his edit interfaces accept a range input of type CharRange = Range<usize>, which means offset based on char.

On the other hand some other rust libraries I use take type ByteRange = Range<usize>, which means offset based on u8.


Currently I am using an O(n) algorithm, and there is a performance bottleneck here.

Is there any efficient data structure to convert between two?

type CharRange = Range<usize>;
type ByteRange = Range<usize>;

fn byte_range_to_char_range(text: &str, byte_range: ByteRange) -> CharRange {
    let start = text[..byte_range.start].chars().count();
    let end = text[..byte_range.end].chars().count();
    start..end
}

fn char_range_to_byte_range(text: &str, char_range: CharRange) -> ByteRange {
    let start = text.char_indices().nth(char_range.start).map(|(i, _)| i).unwrap_or(0);
    let end = text.char_indices().nth(char_range.end).map(|(i, _)| i).unwrap_or(text.len());
    start..end
}

Solution

  • You can improve it slightly by not iterating from the very start again, but it's probably not worth it unless your texts are very long:

    use std::ops::Range;
    type CharRange = Range<usize>;
    type ByteRange = Range<usize>;
    
    pub fn byte_range_to_char_range(text: &str, byte_range: ByteRange) -> CharRange {
        let start = text[..byte_range.start].chars().count();
        let size = text[byte_range.start..byte_range.end].chars().count();
        start..start + size
    }
    
    pub fn char_range_to_byte_range(text: &str, char_range: CharRange) -> ByteRange {
        let mut iter = text.char_indices();
        let start = iter.nth(char_range.start).map(|(i, _)| i).unwrap_or(0);
        let end = iter
            .nth(char_range.end - char_range.start - 1)
            .map(|(i, _)| i)
            .unwrap_or(text.len());
        start..end
    }
    

    But because utf-8 is quite complex we can't do any better.