I am building a program to help count the number of distinct codes that occur in multiple PDFs. I have got the data from the PDFs, that's not a problem. I just need to filter the lines based on whether they contain the codes which are 3 OR 4 numbers separated by full stops. They will always be 3 or 4 numbers, never >4 or <3 numbers long. Below is my current attempt, but the filter I applied finds far too many false positives such as "AO3" and "08.5".
use pdf_extract::extract_text;
use std::ffi::OsStr;
use std::fs;
use std::path::PathBuf;
fn parse_pdf_data(pdf_path: &PathBuf) {
let text: String = extract_text(pdf_path).unwrap();
let decimal_lines: Vec<&str> = text
.lines()
.filter(|line: &&str| {
line.to_string()
.contains(['.', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9'])
})
.collect();
// maybe check length of each line is either 5 or 7?
decimal_lines
.iter()
.for_each(|line: &&str| println!("{line}"));
}
fn main() {
let pdfs: Vec<PathBuf> = fs::read_dir("DO_NOT_COMMIT")
.unwrap()
.map(|file: Result<fs::DirEntry, std::io::Error>| file.unwrap().path())
.filter(|path: &PathBuf| path.extension() == Some(OsStr::new("pdf")))
.collect();
pdfs.iter().for_each(|pdf: &PathBuf| {
println!("Parsing data from {:?}", pdf);
parse_pdf_data(pdf);
});
}
Is there an easier/better way of doing this rather than filtering by those characters, then checking the length for 3 or 4 numbers with full stops (length of 5 or 7)?
Regex is the easiest way to do it:
use regex::Regex; // 1.11.1
const TEXT: &str = "foobar
1
1.2
1.2.3
1.2.3.4
";
fn main() {
let re = Regex::new (r"^[0-9]\.[0-9]\.[0-9](\.[0-9])?$").unwrap();
let decimal_lines: Vec<&str> = TEXT
.lines()
.filter(|line| re.is_match (line))
.collect();
println!("{decimal_lines:?}")
}
The above will only match single digits (e.g. 1.2.3
but not 12.34.56
). If you want to match multiple digits, replace the regexp with: r"^[0-9]+\.[0-9]+\.[0-9]+(\.[0-9]+)?$"
Explanation of the regular expression:
^…$
matches the whole line from the beginning (^
) to the end ($
), otherwise the regexp would match strings like e.g. "abc1.2.3def"
[0-9]
matches a single digit ([0-9]+
matches a sequence of one or more digits)\.
matches a literal .
character (note that a .
without the \
would match any single character)(…)?
makes whatever is inside optional (so here the last \.[0-9]
is optional).