Quick sort "lomuto" partition algorithm: alternate implementation analysis

Please consider the quick sort "lomuto partition" scheme from the classic algorithms textbook Introduction to Algorithms by Cormen, Leiserson, Rivest & Stein.

PARTITION(A, p, r)
x = A[r]
i = p - 1
for j = p to r - 1
    if A[j] <= x
        i = i + 1
        exchange A[i] with A[j]
exchange A[i + 1] with A[r]
return i + 1

After implementing this algorithm, I found that it did not work properly and at best would be off by 1 element. I then proceeded to calculate a solution and finally came up with the following C code that works. I have commented in the code as to how the algorithm works.

int quickSortPartition(int *array, int beginIndex, int endIndex) // end index = array length - 1.
{
    int pivotIndex = beginIndex; // first element as pivot.
    int pivotValue = array[pivotIndex]; // initial pivot value.
    int i = beginIndex + 1; // start loop with i being 2nd index.
    while(i < endIndex) // loop running until end of array.
    {
        if(array[i] > array[i + 1]) // comparing the 2 elements ahead of pivot.
        {
            swap(array, i, i + 1); // swapping the 2 elements if prev element > next element.
        }
        if(array[i] < pivotValue) // comparing element at pivot index with the next index element.
        {
            swap(array, pivotIndex, i); // swapping if next element is less than element at pivot index.
            ++pivotIndex; // incrementing pivot index by 1 ONLY when swap occurs.
        }
        ++i; // drive loop.
    }
    if(array[pivotIndex] > array[pivotIndex + 1]) // at the very end, compare whether the element to the right of pivot is > element at pivot index.
    {
        swap(array, pivotIndex, pivotIndex + 1); // swapping if next element is less than element at pivot index.
        ++pivotIndex; // incrementing pivot index by 1 ONLY when swap occurs.
    }
    return pivotIndex; // returning new pivot index.
}

Is my implementation less efficient than that of the book's so called "lomuto partition"? Sure, they are both O(n), but what about in terms of the number atomic operations like assignments & comparisons in one iteration? Does it make a significant difference in efficiency on large scale cases? Why did the book's algorithm not work? What piece is it missing? I would also appreciate advice on how to further streamline my code.

Additional follow-up information:

I have implemented the book's code in python.

def quicksort(A, p, r):
    if p < r:
        q = partition(A, p, r)
        quicksort(A, p, q - 1)
        quicksort(A, q + 1, r)

def partition(A, p, r):
    x = A[r]
    i = p - 1
    for j in range(p, r - 1):
        if A[j] <= x:
            i = i + 1
            A[i], A[j] = A[j], A[i]
    A[i + 1], A[r] = A[r], A[i + 1]
    return i + 1

array = [2, 55, 43, 12, 65, 72, 41, 18, 6]
print(array)
quicksort(array, 0, len(array) - 1)
print(array)

The result:

unsorted: [2, 55, 43, 12, 65, 72, 41, 18, 6]
sorted: [2, 6, 41, 43, 12, 55, 65, 72, 18]

As you can see, the sorting failed. Also, if I were to use len(array) for the third parameter, there would be an overflow error.

As for my C code that uses my algorithm and works, this is my header file you may test.

#ifndef A1_H
#define A1_H

// am1n

#include<stdio.h>
#include<stdlib.h>

void printArray(int *array, int arrayLength)
{
    int i = 0;
    --arrayLength;
    for(i = 0; i < arrayLength; ++i)
    {
        printf("%d, ", array[i]);
    }
    printf("%d.\n", array[i]);
}

void swap(int *array, int a, int b)
{
    if(array[a] != array[b])
    {
        int tempValue = array[a];
        array[a] = array[b];
        array[b] = tempValue;
    }
}

int fastSortPartition(int *array, int beginIndex, int endIndex) // end index = array length - 1.
{
    int pivotIndex = beginIndex; // first element as pivot.
    int pivotValue = array[pivotIndex]; // initial pivot value.
    int i = beginIndex + 1; // start loop with i being 2nd index.
    while(i < endIndex) // loop running until end of array.
    {
        if(array[i] > array[i + 1]) // comparing the 2 elements ahead of pivot.
        {
            swap(array, i, i + 1); // swapping the 2 elements if prev element > next element.
        }
        if(array[i] < pivotValue) // comparing element at pivot index with the next index element.
        {
            swap(array, pivotIndex, i); // swapping if next element is less than element at pivot index.
            ++pivotIndex; // incrementing pivot index by 1 ONLY when swap occurs.
        }
        ++i; // drive loop.
    }
    if(array[pivotIndex] > array[pivotIndex + 1]) // at the very end, compare whether the element to the right of pivot is > element at pivot index.
    {
        swap(array, pivotIndex, pivotIndex + 1); // swapping if next element is less than element at pivot index.
        ++pivotIndex; // incrementing pivot index by 1 ONLY when swap occurs.
    }
    return pivotIndex; // returning new pivot index.
}

void fastSort(int *array, int beginIndex, int endIndex)
{
    if(beginIndex < endIndex)
    {
        int pivotIndex = fastSortPartition(array, beginIndex, endIndex);
        fastSort(array, beginIndex, pivotIndex - 1);
        fastSort(array, pivotIndex + 1, endIndex);
    }
}

#endif

Solution

Let N be the number of elements in the subarray being processed.
Let L be the number of items that eventually fall into the left part of the array – less than or equal to the pivot in the original algorithm, less than the pivot plus the pivot itself in your code.

Both routines must make N–1 steps through the array and N–1 comparisons to check every element and identify those less than or less-or-equal to the pivot. And the original routine makes exactly that number of comparisons. But your code does twice that much: in every iteration of the loop it compares array[i] to pivotValue AND array[i] to array[i + 1].

Both routines must also make almost exactly L swaps to group those smaller items together, plus up to two swaps for temporarily moving the pivot out of way and back to place. But your code may make up to N–1 additional swaps to push the 'big' elements rightwards with no significant gain.

So the answer is: no, your proposed code certainly is not more efficient than the original algorithm, neither in a number of comparisons nor in a number of swaps.

Your code does about twice as much work (elementary operations) as the Lomuto's method. But that's just a scaling factor; taking things asymptotically they are both linear, O(N).

It turns out I was right: your implementation was NOT the Lomuto algorithm – due to incorrect use of for j in range() you did not test the last item of the array, as the original routine requires.