Search code examples
parsingrustreferencetokenabstract-syntax-tree

References vs Rc's to values in Context for tokenization


I am creating an algebraic tokenizer/parser where implementations of individual token prototypes (such as operators, functions, constants, ...) are contained in a Context struct so that I don't have to clone tokens in the abstract syntax tree, but only references to those contained in the Context.

My question is about using Rcs or references to keep track of tokens. When should I use one or the other? In this case I don't see a real difference, apart from some extra code syntax in the Rc case.

Are there any major downsides in terms of performance?

Playground for reference


Solution

  • With Rc you will be able to build the sequence of tokens in a function and return it, because the tokens own (in a shared fashion) the operators they need.

    fn build_with_rc() -> Vec<TokenWithRc> {
        let add = Operator(String::from("+"));
        let cx = ContextWithRc {
            operators: vec![Rc::new(add)],
        };
        vec![
            TokenWithRc::Number(1.0),
            TokenWithRc::Operator(cx.operators[0].clone()),
            TokenWithRc::Number(2.0),
            TokenWithRc::Operator(cx.operators[0].clone()),
            TokenWithRc::Number(3.0),
        ]
    }
    

    On the other hand, with lifetimes, you cannot express an equivalent function.

    fn build_with_lifetime() -> Vec<TokenWithLifetime<'???>> { // which lifetime?
        // ...
    }
    

    but you could consider passing the context

    fn build_with_lifetime<'a>(
        cx: &'a ContextWithLifetime<'a>
    ) -> Vec<TokenWithLifetime<'a>> {
        vec![
            TokenWithLifetime::Number(1.0),
            TokenWithLifetime::Operator(cx.operators[0]),
            TokenWithLifetime::Number(2.0),
            TokenWithLifetime::Operator(cx.operators[0]),
            TokenWithLifetime::Number(3.0),
        ]
    }
    

    But, since the sequence of tokens keeps a reference to the context, you won't be able to mutate the context while the sequence of tokens exists.

        let add = Operator(String::from("+"));
        let mut cx = ContextWithLifetime {
            operators: vec![&add],
        };
        let tokens = build_with_lifetime(&cx);
        let sub = Operator(String::from("-"));
        cx.operators.push(&sub); // !!! ERROR
        // cannot borrow `cx.operators` as mutable
        // because it is also borrowed as immutable
        println!("{tokens:?}");
    

    In the end, it depends on the intended usage of the context. If it is build once for all and can stay immutable till the end of the algorithm using the tokens, then the lifetime solution is perfect. On the other hand, if the context is supposed to be mutated while the tokens still exists, then the Rc solution is probably better.

    An alternative way would be to use indices (used with cx.operators) instead of references, but you need to be extra-careful when mutating this vector, especially if you remove some operators.

    You asked about any downsides in terms of performances. In theory, Rc implies one more indirection when accessing the structure, and a more fragmented memory layout (hard to prefetch), because each structure is heap-allocated independently. However, in practice, the difference may not be noticeable because the problem is a graph (AST) traversal in which nothing is predictable (in terms of prefetch) when travelling from one node to another. Randomly jumping through well-packed operators in an array is not very different from jumping though many fragmented allocations (except in terms of TLB miss eventually); it would be important in the case of a perfectly regular traversal, where prefetch could help a lot.