In different assembly languages MUL (x86)/MULT (mips) refer to multiplication. It is a black box for the programmer. I am interested in how actually a CPU accomplishes a multiplication regardless of the architecture. Lets say I have two 16-bit values in my registers and I am the cpu, so I have to implement MUL using the other bit-fiddling instructions I have (and,or,xor,not,shl,shr,etc). What shall I do?
http://en.wikipedia.org/wiki/Multiplication_ALU on Wikipedia lists different methods for doing multiplication in a digital circuit.
When I worked on a project to add SIMD instructions to an DEC Alpha-like processor in Verilog back in college, we implemented a Wallace tree multiplier, the primary reason being it ran in a fixed number of cycles and was easy to pipeline.
Apparently, Dadda multipliers are (nearly?) universally used in real life CPU ALUs, including modern x86. Like Wallace multipliers, it can also be pipelined with fixed latency.
EDIT: You mentioned using the other bit fiddling instructions, on modern processors multiplication would not be microcoded like this; it'd be way to slow and the processor would get slaughtered in benchmarks.