java manipulating bytecode/ program instructions / self modifying code detection

basically im trying to create a malware detection program in java that detect self modifying code, the program should run a jar file and identify if it contains self modifying code

one way i thought to do it was, getting the initial bytecode of a .class file and them compare it agains a running application file bytecode, the bytecode of a running .class file should be the same and the initially, if the bytecode is different at certain point it would mean that the program modifies its own structure

the question is how to i get the bytecode of a running application, i want to get the bytecode every 0.1 second and compare it agains the initially bytecode.

is there anyways to get it?

i tried it using java agent, and ASM however i could only get the bytecode before the program is executed, and java agent runs before the program main method is executed.

 import org.objectweb.asm.ClassReader;
 import org.objectweb.asm.ClassWriter;

import java.lang.instrument.ClassFileTransformer;
import java.lang.instrument.IllegalClassFormatException;
import java.lang.instrument.Instrumentation;
import java.security.ProtectionDomain;

public class asm {

    //java agent 
    public static void premain(String agentArgs, Instrumentation inst){
    inst.addTransformer(new ClassFileTransformer() {

        @Override
        public byte[] transform(ClassLoader classLoader, /*class name*/String s, Class<?> aClass, ProtectionDomain protectionDomain, byte[] bytes) throws IllegalClassFormatException {

            if ("other/Stuff".equals(s)) {
                // ASM Code
                ClassReader reader = new ClassReader(bytes);
                ClassWriter writer = new ClassWriter(reader, 0);
                //ClassPrinter visitor = new ClassPrinter(writer);
                //reader.accept(visitor, 0);
                return writer.toByteArray();
            }
            //else{
                //System.out.println("class not loaded");
            return null;
            //}

        }
    })
}

this code uses java agent and ASM, however what i need to know is how do i get the bytecode of a application while it is being executed. also if someone could suggest a different approach on how to identify self modifying code in java, i would appreciate it

thanks in advance

Solution

There are some fundamental misconceptions in your question. First of all:

If you suspect code to contain malware, don’t run it!

There is a research field of analyzing a malware’s behavior in a sandbox but since this requires careful measures to ensure that the software can’t cause any harm, that should be left to experts, in their environments.

Standard malware detection software works by analyzing code without (or before) executing it. Which leads to the question what to search for:

Malware doesn’t need to contain self modifying code to be malware

The characteristic of malware is to perform unintended, harmful actions and in order for them to have an effect, the program needs to perform I/O or start other software on your computer, e.g. in order to cause damage to files, you need file I/O, to send spam or attack other computers, you need network I/O, to perform actions not covered by the Java API, you’ll need to load a native library or launch an external process.

In contrast, in Java, modifying your own code has no effect. The modified code can’t do anything the original code couldn’t and if the modification happens on-the-fly, it doesn’t even have any persistent side-effect on your computer environment. So if the code attempting to modify its own code is indeed malware, it could perform the desired actions directly without that indirection.

Besides that, your idea of repeatedly checking the code is doomed to fail as the JVMs don’t store the original byte code. The code is stored in an implementation dependent way, being optimized for efficient execution. Therefore, when an Agent asks the JVM via the Instrumentation API for the code of a class, it will not return the original code but an equivalent code created by converting back the internal form of the code.

This is indicated by the following statement:

The initial class file bytes represent the bytes passed to ClassLoader.defineClass or redefineClasses (before any transformations were applied), however they might not exactly match them. The constant pool might not have the same layout or contents. The constant pool may have more or fewer entries. Constant pool entries may be in a different order; however, constant pool indices in the bytecodes of methods will correspond. Some attributes may not be present. Where order is not meaningful, for example the order of methods, order might not be preserved.

So you can’t just compare the byte arrays but have to parse the class file to model its semantics and compare that to the result of a previous parse operation. So converting the JVM’s internal code representation to a class file doesn’t come for free, add the parsing and analysis of it and you want to do that for all classes every 0.1 seconds—tough job.

In the end, there’s no need for that. You can control via startup options, which agents will be started and whether attaching of new agents is possible. Without unknown agents, there will be no illegitimate use of the Instrumentation API, thus, no modified code.

Since, in order to truly detect malware, a static code analysis is required, finding out which APIs are used (e.g. I/O, ProcessBuilder, etc), its easy to check for the use of the Instrumentation API or ClassLoaders as well. And besides non-standard APIs (which should always raise warning flags), these are the only possible ways to get new code into the JVM.

The tougher job is to find out which of these API uses are legitimate and which a true sign of malware and to calculate the potential danger for an unknown software. But that’s exactly the challenge, real malware detection software has to accept.