assembly x86 reverse-engineering cpu-registers

Register usage tracking x86

I have a list of assembly instructions for an application, and I'd like to know what registers are free to use and what registers are in use at any index of the list.

How can I know when a register is being used and when it's freed (free to be used again) ? My goal is to reach the registers that are really free.

Here is my assumption to solve the problem, it might sound stupid as I have so limited knowledge of assembly.

Terminology : Read (source), Write (destination)

Flag all instructions as WRITES TO or READS FROM for each register (use http://ref.x86asm.net/coder32.html for all information as described here).
Track all read & write operations to the register and find out when it's free to use. Example:
- Register X is being written by an instruction at 5.
- Register X is being read by an instruction at 8.
- Register X is being written by an instruction at 15.
- I can then say that Register X is not being read between 8-15 and being reassigned at 15. It means that it's free between instruction 8 and 15 but is busy between 5-8.

Does it solve the problem or make any sense? I'm open to other solutions as well.

Update after comments :

I see that JMPS/calls/conditional moves mess up all of this. Just to keep it safe (safe = free registers are really free), how about doing something like this: When I see any each jump/call/conditional move to the outside, I mark all registers as "being read" with a maximally pessimistic assumption as harold describes. I believe that I'd have safer results in that case, even though it won't be good as registers would be in busy state most of the time. Would you be able to agree that my results would be safe in that way?

Instructions:

Instruction at 1 : Writes to Register X.
Instruction at 3 : Conditional JMP / MOV / CALL.
Instruction at 5 : Writes to Register X.
Instruction at 8 : Reads from Register X.
Instruction at 15 : Writes to Register X.

Results :

Between 1-3 : Register X is busy as I count outside relocation at 3 as READ.
Between 3-5 : Register X is free as no more read.
Between 5-8 : Register X is busy as there's a read operation at 8.
Between 8-15 : Register X is free.

Update 2

I'll instead split the application into basic blocks where each block represents a chunk of code between jumps (also conditional), calls and returns. The jump statements will be the end of the block. I'll then analyze the each block assuming that all of the registers are in use at the start. I'll probably miss lots of free registers, but when I get one, I'll know that that one is a really free one =)

Update 3

I'm still trying to improve the solution based on feedbacks (thanks harold).

I've read liveness analysis and as I understand it is recommended to analyze from the end of the application to the start. But I don't know the end of the application in a compiled assembly as mentioned as halt problem in the comments below, so I'll do the other way around with future branches.

Mark all instructions as described. Read = Source, Write = Destination. Conditional moves are counted as read to the source and write to the destination.
Separate all instructions into blocks. Split them with any conditional or direct branch exits.
Link all blocks to each other, each block has may_continue_with container that holds pointers to branches that it might continue with.
Ask a block if a register X is free at A index. Block checks if first access to the register X after index A is a read or write operation. First, it checks its own instructions, if it does not find any instruction operating on the register, it asks the same question to the future branches. (future branches ask further if they don't have any instruction that accesses register X either). The block returns :
- false if any of future branches has read access first
- true if all of the future branches has write access first (only one of the branches is not enough as we do not know which branch will be executed).

Solution

Step 1 is easy, just a lot of work to actually implement (requires parsing of x86 machine code), there are some libraries that may help with that.

Step 2 is not as easy as you make it sound, because of control flow. Liveness analysis is, under the assumption that control flow is simple, is a well known problem. There are some extra problems in this case though.

A procedure may be called whose input registers have not been determined yet. That's mostly due to unlucky ordering, processing procedures depth-first reduces the problem. But there is not necessarily any order for the procedures such that this problem does not arise, because of (mutual) recursion. You may do interprocedural liveness analysis to solve that.
An unknown procedure may be called (virtual methods or explicit calls through function pointers). I don't think you can do much except make a maximally pessimistic assumption.
Even intraprocedural control flow may be unpredictable, such as when a switch is compiled to a jump table. Actually it's not guaranteed that variable jumps will be interprocedural in the first place, you can probably mostly assume that.. I really don't know what to do about that.

I'll then analyze the each block assuming that all of the registers are in use at the start.

That's a huge waste, which can easily be avoided by doing proper liveness analysis.