Search code examples
c++header-filesdependency-graphdependency-tree

Extract an autonomous chunk of the dependency graph of a huge CPP project?


Consider Chromium codebase. It's huge, around 4gb of pure code, if I'm not mistaken. But however humongous it may be, it's still modular in its nature. And it implements a lot of interesting features in its internals.

What I mean is for example I'd like to extract websocket implementation out of the sources, but it's not easy to do by hand. Ok, if we go to https://github.com/chromium/chromium/tree/main/net/websockets we'll see lots of header files. To compile the code as a "library" we're gonna need them + their implementation in .cpp files . But the trick is that these header files include other header files in other directories of the chromium project. And those in their turn include others...

BUT if there are no circular dependencies we should be able to get to the root of this tree, where header files won't include anything (or will include already compiled libraries), which should mean that all the needed files for this dependency subtree are in place, so we can compile a chunk of the original codebase separate from the rest of it.

That's the idea. At least in theory.

Does anyone know how it could be done? I've found this repo and this repo, but they only show the dependency graph and do not have the functionality to extract a tree from it.

There should be a tool already, I suppose. It's just hard to word it out to google. Or perhaps I'm mistaken and this approach wouldn't really work?


Solution

  • Your compiler is almost surely capable of extracting this dependency information so that it can be used to help the build system figure out incremental builds. In gcc, for instance, we have the -MMD flag.

    Suppose we have four compilation units, ball.cpp, football.cpp, basketball.cpp, and hockey.cpp. Each source file includes a header file of the same name. Also, football.hpp and basketball.hpp each include ball.hpp.

    If we run

    g++ -MMD   -c -o football.o football.cpp
    g++ -MMD   -c -o basketball.o basketball.cpp
    g++ -MMD   -c -o hockey.o hockey.cpp
    g++ -MMD   -c -o ball.o ball.cpp
    

    then this will produce, in addition to the object files, some files with names like basketball.d that contain dependency information like

    basketball.o: basketball.cpp basketball.h ball.h
    

    It's simple enough to read these into, say, a python script, and then just take the union of all the dependencies of the files you want to include.


    EDIT: In fact, python may even be overkill. In the situation above, if you wanted to get all dependencies for anything containing the word "ball," you could do something like

    $ cat *.d | awk -F: '$1 ~ "ball" { print $2 }' | xargs -n 1 echo | sort | uniq
    

    which will output

    ball.cpp
    ball.h
    basketball.cpp
    basketball.h
    football.cpp
    football.h
    

    If you're not used to reading UNIX pipelines, this:

    • Concatenates all the *.d files in the current directory;
    • Goes through them line-by-line, splitting each line into fields delimited by : characters;
    • Prints out the second field (i.e. the list of dependencies) for any line where the first field (i.e. the target) matches the regex "ball";
    • Splits the results into individual lines;
    • Sorts the resulting lines; and
    • Throws out any duplicates.

    You can see that this produced a list of everything the ball-related files depend on, but skipped hockey.cpp and hockey.hpp which aren't dependencies of any file with "ball" in its name. (Of course in your case you might use "websockets" instead of "ball," and if there is some directory structure instead of everything being in the root directory you may have to do a bit to compensate for that.)