machine-learning reinforcement-learning q-learning

is Q-learning without a final state even possible?

I have to solve this problem with Q-learning. Well, actually I have to evaluated a Q-learning based policy on it.

I am a tourist manager.

I have n hotels, each can contain a different number of persons.

for each person I put in a hotel I get a reward, based on which room I have chosen.

If I want I can also murder the person, so it goes in no hotel but it gives me a different reward. (OK,that's a joke...but it's to say that I can have a self transition. so the number of people in my rooms doesn't change after that action).

my state is a vector containing the number of persons in each hotel.
my action is a vector of zeroes and ones which tells me where do I
put the new person.
my reward matrix is formed by the rewards I get for each transition
between states (even the self transition one).

now,since I can get an unlimited number of people (i.e. I can fill it but I can go on killing them) how can I build the Q matrix? without the Q matrix I can't get a policy and so I can't evaluate it...

What do I see wrongly? should I choose a random state as final? Do I have missed the point at all?

Solution

RL problems don't need a final state per se. What they need is reward states. So, as long as you have some rewards, you are good to go, I think.

I don't have a lot of XP with RL problems like this one. As a commenter suggests, this sounds like a really huge state space. If you are comfortable with using a discrete approach, you would get a good start and learn something about your problem by limiting the scope (finite number of people and hotels/rooms) of the problem and turning Q-learning loose on the smaller state matrix.

OR, you could jump right into a method that can handle infinite state space like an neural network.

In my experience if you have the patience of trying the smaller problem first, you will be better prepared to solve the bigger one next.