Understanding the solvePnP Algorithm

I'm having trouble understanding the Perspective-n-Point problem. A few questions:

What is s for? Why do we need a scale factor for the image point?
Is K[R|T] a "change of coordinates matrix" which moves p_w, the homogenous world point, into the coordinate space of the 2D image plane?
I understand that [R|T] represents the "rotation and translation" of the camera relative to the corresponding world point p_w and that is what we are trying to solve for. What's particularly difficult about this? Can't we just say [R|T] =inv(K)s(p_c)inv(p_w)? I just did this with some basic matrix algebra.
I don't understand why PnP has multiple solutions... what are these multiple solutions exactly?

Thanks for any help!

Solution

Scale factor is needed to determine if there is little object viewed from small distance or big object viewed from higher distance

In typical camera pinhole equation

s represents Z coordinate of point in camera coordinate system

Right, K[R|t] is projection matrix, which maps 3d coordinates in some object/world/global coordinate system into image 2d coordinates as in equation above.
It is not so easy, because you often don't know point cooridnates in camera coordinate syetem, but know 2d coordinates in image coordinate system. Transformation between camera coordinates system and image coordinate system looses one dimension, and there is also scale factor which makes our equation not-exactly linear. That's why it is not so easy to compute.
Different algorithms uses different approaches to add additional information needed for solution. For example DLT (direct linear transform) method uses features of projection matrix. Beside analytic solutions there are also many methods which use nonlinear optimization - for example Levenberg-Marquardt used in openCV.