For a program I'm working on, I need to extract the instructions of an ELF binary compiled for the risc-v arch. The way i'm trying to extract the instructions is the following:
void dumpCode(FILE *file, Elf32_Phdr *segm, Elf32_Ehdr *header)
{
char *fileptr;
struct stat statbuf;
int *opcode_ptr;
unsigned int i, vaddr, offset;
int fd = fileno(file);
if (fstat(fd, &statbuf)) {
fprintf(stderr, "[-] Error while stating the file!\n");
goto fail;
}
fileptr = (char *)mmap(0, statbuf.st_size, PROT_READ, MAP_PRIVATE, fd, 0);
if (MAP_FAILED == fileptr) {
fprintf(stderr, "[-] Error mapping the file!\n");
goto fail;
}
offset = (0 == segm->p_offset ? header->e_ehsize + header->e_phnum * header->e_phentsize : segm->p_offset); // Mark 1
opcode_ptr = (int *)(fileptr + offset);
vaddr = (0 == segm->p_offset ? segm->p_vaddr + header->e_ehsize + header->e_phnum * header->e_phentsize : segm->p_vaddr); // Mark 2
for (i = 0; i < segm->p_filesz / 4; i++, vaddr += 4) { // Mark 3
unsigned char *opcode = getOpcode(*opcode_ptr++);
if (1 == disas(opcode, vaddr)) {
free(opcode);
break;
}
free(opcode);
}
munmap(fileptr, statbuf.st_size);
fail:
close(fd);
}
To test my function, first I wrote a simple assembly program:
.global _start
_start:
addi a0, x0, 1
la a1, str
addi a2, x0, 6
addi a7, x0, 64
ecall
addi a0, x0, 0
addi a7, x0, 93
ecall
.data
str: .ascii "Hello\n"
As a second test file I wrote a different code, this time in C
#include <stdio.h>
#include <math.h>
int main(void)
{
printf("%.5f", sqrt(2.0));
return 0;
}
The first test file has been compiled and assembled using: riscv32-linux-gnu-as -o test1.o test1.s; riscv32-linux-gnu-ld -o test1 test1.o
The second test file has been compiled directly with gcc riscv32-linux-gnu-gcc -o test2 test2.c -lm
Returning to the dumpCode function, I've marked three lines.
How could I calculate the right amount of bytes I need to process from the elf file? If that amount of bytes includes padding, could I ignore it somehow?
offset = (0 == segm->p_offset
? header->e_ehsize + header->e_phnum * header->e_phentsize
: segm->p_offset); // Mark 1
This is wrong. The document is very specific:
p_offset This member holds the offset from the beginning of the file at which the first byte of the segment resides.
The data starts at segm->p_offset. No ifs or buts.
If you are seeing 0, I suspect it is because the segment doesn't have the PT_LOAD
flag, meaning it's not in the file at all (and the offset in the file doesn't make sense) or because the segment is supposed to contain the ELF header (so offset 0 isn't wrong).
There is no distinction to the CPU between instructions and non-instructions. Every 4 bytes could possibly be an instruction. Even 00000000 is an instruction. An instruction is whatever the program counter points to. You could try to figure out where the program counter can point, but that's equivalent to the halting problem, therefore impossible.
There may be debug information or symbols that say which say which part of the file is padding, but since the CPU doesn't care, neither does the main part of the ELF file.