Search code examples
rustfiltersubstringpattern-matching

In Rust can I filter lines of a String that contain codes like "1.2.3" OR "1.2.3.4"?


I am building a program to help count the number of distinct codes that occur in multiple PDFs. I have got the data from the PDFs, that's not a problem. I just need to filter the lines based on whether they contain the codes which are 3 OR 4 numbers separated by full stops. They will always be 3 or 4 numbers, never >4 or <3 numbers long. Below is my current attempt, but the filter I applied finds far too many false positives such as "AO3" and "08.5".

use pdf_extract::extract_text;
use std::ffi::OsStr;
use std::fs;
use std::path::PathBuf;

fn parse_pdf_data(pdf_path: &PathBuf) {
    let text: String = extract_text(pdf_path).unwrap();

    let decimal_lines: Vec<&str> = text
        .lines()
        .filter(|line: &&str| {
            line.to_string()
                .contains(['.', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9'])
        })
        .collect();

    // maybe check length of each line is either 5 or 7?

    decimal_lines
        .iter()
        .for_each(|line: &&str| println!("{line}"));
}

fn main() {
    let pdfs: Vec<PathBuf> = fs::read_dir("DO_NOT_COMMIT")
        .unwrap()
        .map(|file: Result<fs::DirEntry, std::io::Error>| file.unwrap().path())
        .filter(|path: &PathBuf| path.extension() == Some(OsStr::new("pdf")))
        .collect();

    pdfs.iter().for_each(|pdf: &PathBuf| {
        println!("Parsing data from {:?}", pdf);
        parse_pdf_data(pdf);
    });
}

Is there an easier/better way of doing this rather than filtering by those characters, then checking the length for 3 or 4 numbers with full stops (length of 5 or 7)?


Solution

  • Regex is the easiest way to do it:

    use regex::Regex; // 1.11.1
    
    const TEXT: &str = "foobar
    1
    1.2
    1.2.3
    1.2.3.4
    ";
    
    fn main() {
        let re = Regex::new (r"^[0-9]\.[0-9]\.[0-9](\.[0-9])?$").unwrap();
        let decimal_lines: Vec<&str> = TEXT
            .lines()
            .filter(|line| re.is_match (line))
            .collect();
        println!("{decimal_lines:?}")
    }
    

    Playground

    The above will only match single digits (e.g. 1.2.3 but not 12.34.56). If you want to match multiple digits, replace the regexp with: r"^[0-9]+\.[0-9]+\.[0-9]+(\.[0-9]+)?$"

    Explanation of the regular expression:

    • ^…$ matches the whole line from the beginning (^) to the end ($), otherwise the regexp would match strings like e.g. "abc1.2.3def"
    • [0-9] matches a single digit ([0-9]+ matches a sequence of one or more digits)
    • \. matches a literal . character (note that a . without the \ would match any single character)
    • (…)? makes whatever is inside optional (so here the last \.[0-9] is optional).