Search code examples
performanceparsingrustbiblatex

Parsing biblatex (.bib) file and converting it to needed vec takes huge amount of time


I want to broaden my horizons in terms of programming languages and, thus, are trying to build a little helper app for managing my .bib file in Rust.

Now I'm stuck on a problem for which I couldn't find a solution so far.

I wrote a module to read in my .bib file, parse it and process the entries in a way that it outputs a vector which contains an inner vector for every bibliographic entry of the file with the needed fields. The output should look like this (using {:#?} with println):

[
        [
            "Grandsire",
            "The METAFONTtutorial",
            "2004",
            "online",
            "grandsire_the_metafonttutorial_2004",
        ],
        [
            "Gruber",
            "Daring Fireball",
            "2001",
            "online",
            "gruber_markdown",
        ],
        [
            "Schmuli (ed.)",
            "How TeX macros actually work",
            "emtpy",
            "online",
            "how_tex_macros_actually_work",
        ],
        [
            "Skibinski",
            "Automated JATS XML to PDF conversion",
            "2018",
            "online",
            "skibinski_automated_jats_xml_to_pdf_conversion_2018",
        ],
],

My code is able to produce that output, but with a testfile containing around 500 separate (and simplistic) biblatex entries it takes a huge amount of time!

When I run the bin with $ time bibfilebin /path/to/testbib.bib it takes about 10 seconds to process the whole thing!

That seems waaay too long for a language like Rust, known for its speed, processing a simple plain text file.

I'm sure the mistake is due to my very limited Rust knowledge in particular, plus my general knowledge towards real programming languages in general; since I'm not a trained programmer.

It may be due to the many loops/iterators in the code or a programming mistake which reads the file over and over again for every entry. But, so far, I couldn't find the source of it.

My code is here, it uses additional crates biblatex and sarge:

use bib::BibiData;
use cliargs::*;

pub mod cliargs {
    use core::panic;
    use std::path::PathBuf;

    use sarge::prelude::*;

    sarge! {
        // Name of the struct
        ArgumentsCLI,
    }

    pub struct PosArgs {
        pub bibfilearg: PathBuf,
    }

    impl PosArgs {
        pub fn parse_pos_args() -> Self {
            let (_, pos_args) =
                ArgumentsCLI::parse().expect("Could not parse positional arguments");
            Self {
                bibfilearg: if pos_args.len() > 1 {
                    PathBuf::from(&pos_args[1])
                } else {
                    panic!("No path to bibfile provided as argument")
                },
            }
        }
    }
}

pub mod bib {
    use super::PosArgs;
    use std::{fs, path::PathBuf};

    use biblatex::{self, Bibliography};
    use biblatex::{ChunksExt, Type};

    pub struct BibiMain {
        pub bibfile: PathBuf,           // path to bibfile
        pub bibfilestring: String,      // content of bibfile as string
        pub bibliography: Bibliography, // parsed bibliography
        pub citekeys: Vec<String>,      // list of all citekeys
    }

    impl BibiMain {
        pub fn new() -> Self {
            let bibfile = PosArgs::parse_pos_args().bibfilearg;
            let bibfilestring = fs::read_to_string(&bibfile).unwrap();
            let bibliography = biblatex::Bibliography::parse(&bibfilestring).unwrap();
            let citekeys = Self::get_citekeys(&bibliography);
            Self {
                bibfile,
                bibfilestring,
                bibliography,
                citekeys,
            }
        }

        pub fn get_citekeys(bibstring: &Bibliography) -> Vec<String> {
            let mut citekeys: Vec<String> =
                bibstring.iter().map(|entry| entry.to_owned().key).collect();
            citekeys.sort_by_key(|name| name.to_lowercase());
            citekeys
        }
    }

    #[derive(Debug)]
    pub struct BibiData {
        pub entry_list: BibiDataSets,
    }

    impl BibiData {
        pub fn new() -> Self {
            let citekeys = BibiMain::new().citekeys;
            Self {
                entry_list: BibiDataSets::from_iter(citekeys),
            }
        }
    }

    #[derive(Debug)]
    pub struct BibiDataSets {
        pub bibentries: Vec<Vec<String>>,
    }

    impl FromIterator<String> for BibiDataSets {
        fn from_iter<T: IntoIterator<Item = String>>(iter: T) -> Self {
            let bibentries = iter
                .into_iter()
                .map(|citekey| BibiEntry::new(&citekey))
                .collect();
            Self { bibentries }
        }
    }

    #[derive(Debug)]
    pub struct BibiEntry {
        pub authors: String,
        pub title: String,
        pub year: String,
        pub pubtype: String,
        pub citekey: String,
    }

    impl BibiEntry {
        pub fn new(citekey: &str) -> Vec<String> {
            vec![
                Self::get_authors(citekey),
                Self::get_title(citekey),
                Self::get_year(citekey),
                Self::get_pubtype(citekey),
                citekey.to_string(),
            ]
        }

        fn get_authors(citekey: &str) -> String {
            let biblio = BibiMain::new().bibliography;
            let authors = {
                if biblio.get(&citekey).unwrap().author().is_ok() {
                    let authors = biblio.get(&citekey).unwrap().author().unwrap();
                    if authors.len() > 1 {
                        let authors = format!("{} et al.", authors[0].name);
                        authors
                    } else if authors.len() == 1 {
                        let authors = authors[0].name.to_string();
                        authors
                    } else {
                        let editors_authors = format!("empty");
                        editors_authors
                    }
                } else {
                    if biblio.get(&citekey).unwrap().editors().is_ok() {
                        let editors = biblio.get(&citekey).unwrap().editors().unwrap();
                        if editors.len() > 1 {
                            let editors = format!("{} (ed.) et al.", editors[0].0[0].name);
                            editors
                        } else if editors.len() == 1 {
                            let editors = format!("{} (ed.)", editors[0].0[0].name);
                            editors
                        } else {
                            let editors_authors = format!("empty");
                            editors_authors
                        }
                    } else {
                        let editors_authors = format!("empty");
                        editors_authors
                    }
                }
            };
            authors
        }

        fn get_title(citekey: &str) -> String {
            let biblio = BibiMain::new().bibliography;
            let title = {
                if biblio.get(&citekey).unwrap().title().is_ok() {
                    let title = biblio
                        .get(&citekey)
                        .unwrap()
                        .title()
                        .unwrap()
                        .format_verbatim();
                    title
                } else {
                    let title = format!("no title");
                    title
                }
            };
            title
        }

        fn get_year(citekey: &str) -> String {
            let biblio = BibiMain::new().bibliography;
            let year = biblio.get(&citekey).unwrap();
            let year = {
                if year.date().is_ok() {
                    let year = year.date().unwrap().to_chunks().format_verbatim();
                    let year = year[..4].to_string();
                    year
                } else {
                    let year = format!("emtpy");
                    year
                }
            };
            year
        }

        fn get_pubtype(citekey: &str) -> String {
            let biblio = BibiMain::new().bibliography;
            let pubtype = biblio.get(&citekey).unwrap().entry_type.to_string();
            pubtype
        }
    }
}

fn main() {
    let entry_vec = BibiData::new().entry_list;

    println!("{:#?}", entry_vec);
}

I'm aware that there might be some real beginners mistakes.

Therefore, I appreciate every kind of help or just a hint. First of all, to solve the problem, but also to help me learn the concepts ans ways how to code in Rust.


Solution

  • As suggested, I'll provide a (for the moment, partly) answer:

    In all get_... functions I replaced the direct call of BibiMain::new() with a parameter biblio: &Bibliography, as @MindSwipe explained.

    E.g. get_title now looks like this (old var is commented out):

            fn get_title(citekey: &str, biblio: &Bibliography) -> String {
                // let biblio: &Bibliography = &BibiMain::new().bibliography;
                let title = {
                    if biblio.get(&citekey).unwrap().title().is_ok() {
                        let title = biblio
                            .get(&citekey)
                            .unwrap()
                            .title()
                            .unwrap()
                            .format_verbatim();
                        title
                    } else {
                        let title = format!("no title");
                        title
                    }
                };
                title
            }
    

    It solves most issues and makes the startup at runtime much faster!

    Furthermore, I merged FromIterator into BibiData struct and passed the needed fields as parameter:

    impl BibiData {
        pub fn new(biblio: &Bibliography, citekeys: &Vec<String>) -> Self {
            Self {
                entry_list: {
                    let bibentries = citekeys
                        .into_iter()
                        .map(|citekey| BibiEntry::new(&citekey, &biblio))
                        .collect();
                    BibiDataSets { bibentries }
                },
            }
        }
    }
    

    Everything works fine now and is instantly ready.

    Thanks to everybody!