C/C++ exposure of functions / methods vs Java

The world of Minecraft Modding has made me curious about differences in mechanisms between Java and C/C++ libraries to allow methods / functions in the libraries to be invoked externally.

My understanding is that Minecraft Modding came about due to the ability to decompile / reflect over Java in order to reverse engineer classes and methods that can be invoked from the library. I believe that the Java class specification includes quite a lot of metadata about the structure of classes allowing code to be used in ways other than intended.

There are some obfuscation tools around that try to make it harder to reverse engineer Java but overall it seems to be quite difficult to prevent.

I don't have the depth of knowledge in C/C++ to know to what degree the same can be done there.

For C/C++ code is compiled natively ahead of time. The end result is an assembly of machine code specific for that platform. C/C++ has the notion of externalising functions so that they can be exposed from outside the library or executable. Some libraries also have an entry point.

Typically when connecting to external functions there is a header file to list what functions are available to code against from the library.

I would assume there would need to be a mechanism to map an exposed function to the address within the library / executable machine code assembly so the function calls get made in the right place.

Typically connecting the function calls together with the address is the job of the linker. The linker still needs to somehow know where to find these functions.

This makes me wonder if it is fundamentally possible to invoke non exported functions. If so would this require the ability to locate their address and understand their parameter format?

function calls in C/C++ as I understand it is typically done by assigning the parameters to registers for simple functions or to an argument array for more complex functions.

I don't know if the practice of invoking non-public API's in native code is common or if the inherent difficulty in doing so makes native code pretty safe from this kind of use.

Solution

First of all, there are tools (of varying quality and capabilities) to reverse engineer compiled machine-code back to the original language [or another language, for that matter]. The biggest problem when doing this is that languages such as C and C++, the names of members in a structure don't have names, and often become "flat", so what is originally:

 struct user
 {
    std::string name;
    int age;
    int score;
 };

will become:

 struct s0
 {
     char *f0;
     char *f1;
     int f2;
     int f3;
 };

[Note of course that std::string may be implemented in a dozen different ways, and the "two pointers" is just one plausible variant]

Of course, if there is a header file describing how the library works, you can use the data structures in that to get better type information. Likewise, if there is debug information in the file, it can be used to form data structures and variable names in a much better way. But someone who wants to keep these things private will (most often) not ship the code with debug symbols, and only publish the actual necessary parts to call the public functionality.

But if you understand how these are used [or read some code that for example displayed a "user", you can figure out what is the name, the age and what is the score.

Understanding what is an array and what is separate fields can also be difficult. Which is it:

 struct
 {
    int x, y, z;
 };

 int arr[3];

Several years ago, I started on a patience card game (Similar to "Solitaire"). To do that, I needed a way to display cards on the screen. So I thought "well, there's one for the existing Solitaire on Windows, I bet I can figure out how to use that", and indeed, I did. I could draw the Queen of Clubs or Two of Spades, as I wished. I never finished the actual game-play part, but I certainly managed to load the card-drawing functionality from a non-public shared library. Not rocket science by any means (there are people who do this for commercial games with thousands of functions and really complex data structures - this had two or three functions that you needed to call), but I didn't spend much time on it either, a couple of hours if I remember right, from coming up with the idea to having something that "works".

But for the second part of your question, plugin-interfaces (such as filter plugins to Photoshop, or transitions in video editors), are very often implemented as "shared libraries" (aka "dynamic link libraries", DLLs).

There are functions in the OS to load a shared library into memory, and to query for functions by their name. The interface of these functions is (typically) pre-defined, so a function pointer prototype in a header-file can be used to form the actual call.

As long as the compiler for the shared library and the application code are using the the same ABI (application binary interface), all should work out when it comes to how arguments are passed from the caller to the function - it's not like the compiler just randomly uses whatever register it fancies, the parameters are passed in a well-defined order and which register is used for what is defined by the ABI specification for a given processor architecture. [It gets further more complicated if you have to know the contents of data structures, and there are different versions of such structures - say for example someone has a std::string that contains two pointers (start and end), and for whatever reason, the design is changed to be one pointer and a length - both the application code and the shared library need to be compiled with the same version of std::string, or bad things will happen!]

Non-public API functions CAN be called, but they wouldn't be discoverable by calling the query for finding a function by name - you'd have to figure out some other way - for example by knowing that "this function is 132 bytes on from the function XYZ", and of course, you wouldn't have the function prototype either.

There is of course the added complication where Java Bytecode is portable for many different processor architectures, machine code only works on a defined set of processors - code for x86 works on Intel and AMD processors (and maybe a few others), code for ARM processors work in chips developed with the ARM instruction set, and so on. You have to compile the C or C++ code for the given process.