Regex for TODO keyword when passing through a list of directories to get a list of files with TODO keyword (eg. //TODO) but not as variable / string

I'm trying to write an application that looks through a directory and flag out all files (be it in directory or subdirectories) that has the TODO keyword (the one that flashes/highlights in color whenever we code in our code editor [i am using visual studio code]

I have gotten most of the code running, its just the last bit that is puzzling me : because my RegEx accepts 'TODO' as a word block, it picks up even files that has TODO as variable name / string content eg.

var todo = 'TODO' or var TODO = 'abcdefg'

so it is messing up with my test cases. How do we write a robust TODO regex / expression that is able to pick up just the TODO keyword (eg. //TODO or // TODO) and ignore the other use cases (in variables/strings etc) I dont want to hardcode // or anything in the regex as well, as i would prefer it to be cross-language as much as possible (eg. // (single-line) or /* (multi-line) for javascript, # for python etc)

Here is my code:

import * as fs from 'fs'; 
import * as path from 'path';

const args = process.argv.slice(2);
const directory = args[0];

// Using recursion, we find every file with the desired extention, even if its deeply nested in subfolders.
// Returns a list of file paths
const getFilesInDirectory = (dir, ext) => {
  if (!fs.existsSync(dir)) {
    console.log(`Specified directory: ${dir} does not exist`);
    return;
  }

  let files = [];
  fs.readdirSync(dir).forEach(file => {
    const filePath = path.join(dir, file);
    const stat = fs.lstatSync(filePath); // Getting details of a symbolic link of file

    // If we hit a directory, recurse our fx to subdir. If we hit a file (basecase), add it to the array of files
    if (stat.isDirectory()) {
      const nestedFiles = getFilesInDirectory(filePath, ext);
      files = files.concat(nestedFiles);
    } else {
      if (path.extname(file) === ext) {
        files.push(filePath);
      }
    }
  });

  return files;
};



const checkFilesWithKeyword = (dir, keyword, ext) => {
  if (!fs.existsSync(dir)) {
    console.log(`Specified directory: ${dir} does not exist`);
    return;
  }

  const allFiles = getFilesInDirectory(dir, ext);
  const checkedFiles = [];

  allFiles.forEach(file => {
    const fileContent = fs.readFileSync(file);

    // We want full words, so we use full word boundary in regex.
    const regex = new RegExp('\\b' + keyword + '\\b');
    if (regex.test(fileContent)) {
      // console.log(`Your word was found in file: ${file}`);
      checkedFiles.push(file);
    }
  });

  console.log(checkedFiles);
  return checkedFiles;
};

checkFilesWithKeyword(directory, 'TODO', '.js');

Help is greatly appreciated!!

Solution

I don't think there is a reliable way to exclude TODO in variable names or string values across languages. You'd need to parse each language properly, and scan for TODO in comments.

You can do an approximation that you can tweak over time:

for variable names you'd need to exclude TODO = assignments, and any type of use, such as TODO.length
for string value you could exclude 'TODO' and "TODO", and even "Something TODO today" while looking for matching quotes. What about a multi-line string with backticks?

This is a start using a bunch of negative lookaheads:

const input = `Test Case:
// TODO blah
// TODO do "stuff"
/* stuff
 * TODO
 */
let a = 'TODO';
let b = 'Something TODO today';
let c = "TODO";
let d = "More stuff TODO today";
let TODO = 'stuff';
let l = TODO.length;
let e = "Even more " + TODO + " to do today";
let f = 'Nothing to do';
`;
let keyword = 'TODO';
const regex = new RegExp(
  // exclude TODO in string value with matching quotes:
  '^(?!.*([\'"]).*\\b' + keyword + '\\b.*\\1)' +
  // exclude TODO.property access:
  '(?!.*\\b' + keyword + '\\.\\w)' +
  // exclude TODO = assignment
  '(?!.*\\b' + keyword + '\\s*=)' +
  // final TODO match
  '.*\\b' + keyword + '\\b'
);
input.split('\n').forEach((line) => {
  let m = regex.test(line);
  console.log(m + ': ' + line);
});

Output:

false: Test Case:
true: // TODO blah
true: // TODO do "stuff"
false: /* stuff
true:  * TODO
false:  */
false: let a = 'TODO';
false: let b = 'Something TODO today';
false: let c = "TODO";
false: let d = "More stuff TODO today";
false: let TODO = 'stuff';
false: let l = TODO.length;
false: let e = "Even more " + TODO + " to do today";
false: let f = 'Nothing to do';
false:

Explanation of composition of regular expression:

^ - start of string (in our case start of line due to split)
exclude TODO in string value with matching quotes:
- (?! - negative lookahead start
- .* - greedy scan (scan over all chars, but still match what follows)
- (['"]) - capture group for either a single quote or a double quote
- .* - greedy scan
- \b - word woundary before keyword (expect keyword enclosed in non-word chars)
- add keyword here
- \b - word woundary after keyword
- .* - greedy scan
- \1 - back reference to capture group (either a single quote or a double quote, but the one captured above)
- ) - negative lookahead end
exclude TODO.property access:
- (?! - negative lookahead start
- .* - greedy scan
- \b - word woundary before keyword
- add keyword here
- \.\w - a dot followed by a word char, such as .x
- ) - negative lookahead end
exclude TODO = assignment
- (?! - negative lookahead start
- .* - greedy scan
- \b - word woundary before keyword
- add keyword here
- \s*= - optional spaces followed by =
- ) - negative lookahead end
final TODO match
- .* - greedy scan
- \b - word woundary (expect keyword enclosed in non-word chars)
- add keyword here
- \b - word woundary

Learn more about regular expressions: https://twiki.org/cgi-bin/view/Codev/TWikiPresentation2018x10x14Regex