Until now I thought that a 32-bit processor can use 4 GiB of memory because 232 is 4 GiB, but this approach means processor have word size = 1 byte. So a process with 32-bit program counter can address 232 different memory words and hence we have 4 GiB.
But if a processor has word size larger than 1 byte, which is the case with most of processors now days I believe (My understanding is that word size is equal to the width of data bus, so a processor with 64-bit data bus must have a word size = 8 bytes).
Now same processor with 32 bit Program counter can address 2^32 different memory words, but in this case word size is 8 bytes hence it can address more memory which contradicts with 4 GiB thing, so what is wrong in my argument ?
Indeed a 32-bit program counter can address 232 different memory locations, but word-addressable memory is only used in architectures for very special purposes like DSPs or antique architectures in the past. Modern architectures for general computing all use byte-addressable memory.
But in practice, modern 32-bit ISAs have extensions to allow wider addresses, often introduced as a stopgap before a 64-bit version of the ISA was ready.
Your premise is incorrect. 32-bit architectures can address more than 4GB of memory, just like most (if not all) 8-bit microcontrollers can use more than 256 bytes of memory. 4 GiB was huge when 32-bit systems were new, but is now rather cramped, especially system-wide total memory (physical address space, vs. virtual address-space of a single process.)
Even moder in 32-bit byte-addressable architectures there are many ways to access more than 4GB of memory. For example 64-bit JVM can address 32GB of memory with 32-bit pointer using compressed Oops. See the Trick behind JVM's compressed Oops
32-bit x86 CPUs can also address 64GB (or more in later versions) of memory with PAE. It basically adds a another level of indirection in the TLB with a few more bits in the address. That allows the whole system to access more than 4GB of memory. However the pointers in applications are still 32-bit long so each process is still limited to 4GB at most. The analog on ARM is LPAE.
The 4GB address space of each process is often split into user and kernel space (before Meltdown), hence limited the user memory even more. There are several ways to workaround this