Why is this C++ deque code 'less efficient' when I demodularize it?

This is a problem I faced when studying solutions to this problem on HackerRank. It basically boils down to the following: given an array A and an integer K, the problem asks you to find the maximum of each contiguous subarray of A with size K. There is some additional stuff (for each test case, this problem occurs T times, and one must read from input to obtain A), but that's the gist of it.

The site checks whether your answer is sufficiently efficient, and even if your code produces the correct output in the end, the submission may fail due to timeout. Now, when your first get to the code area, there is some predefined code for you:

#include <iostream>
#include <deque> 
using namespace std;

void printKMax(int arr[], int n, int k){
    //Write your code here.
}

int main(){

    int t;
    cin >> t;
    while(t>0) {
        int n,k;
        cin >> n >> k;
        int i;
        int arr[n];
        for(i=0;i<n;i++)
            cin >> arr[i];
        printKMax(arr, n, k);
        t--;
    }
    return 0;
}

All fine and dandy, the site already handles input for your benefit, and points you towards a modular approach. All you have to do is actually solve the contiguous subarray problem; I did it with the following function, which passes all test cases:

void printKMax(int arr[], int n, int k){
    // Queue will store up to k elements (moving window)
    // Elements stored will be indices of the array. Indices allow to compare with loop iterator for removal of oldest element as window moves
    // Most recent elements are pushed to the back (so they take longer to be popped)
    std::deque<int> Q(k); 

    // Iterate over the first window
    int i; 
    for (i = 0; i < k; ++i) { 
        // A new element (arr[i]) to be added at the back invalidates all elements added before it that are no greater (if they are smaller, they cannot be a maximum; if they are equal, the new element is more recent). Those elements are popped
        // This guarantees that the values corresponding to Q's indices will be increasing (strictly) from back to front. In particular, the front will be the maximum
        while ((!Q.empty()) && arr[i] >= arr[Q.back()]) 
            Q.pop_back();
        Q.push_back(i); 
    }

    // Iterate, letting the window move
    for (; i < n; ++i) { 
        // Print the largest element (window has not moved yet), followed by a space
        cout << arr[Q.front()] << " "; 

        // Use indices to determine if oldest element has gone out of the window
        if (Q.front() <= i - k) Q.pop_front();

        // Add new element, like before
        while ((!Q.empty()) && arr[i] >= arr[Q.back()]) 
            Q.pop_back(); 
        Q.push_back(i); 
    } 

    // Print the maximum element of last window, and end the line 
    cout << arr[Q.front()] << endl; 
}

However, looking back on it, I thought I could do (slightly?) better. For each test case, the input handling in main loops n times to fill the array arr, and then printKMax in turn also loops n times to create and slide the moving window. I thought I could combine these into a single loop, which I guess should be more efficient. I did it with the following code, which is basically the same as before but with printKMax incorporated into the main. I'll omit comments this time.

int main(){

    int t;
    cin >> t;
    while(t>0) {
        int i = 0, n, k;
        cin >> n >> k;
        int arr[n];

        std::deque<int> Q(k);

        for (; i < k; ++i) {
            cin >> arr[i];
            while ((!Q.empty()) && arr[i] >= arr[Q.back()]) 
                Q.pop_back();
            Q.push_back(i); 
        }

        for (; i < n; ++i) {
            cout << arr[Q.front()] << " "; 

            if (Q.front() <= i - k) Q.pop_front();

            cin >> arr[i];
            while ((!Q.empty()) && arr[i] >= arr[Q.back()]) 
                Q.pop_back(); 
            Q.push_back(i); 
        } 

        cout << arr[Q.front()] << endl;

        t--;
        }
    return 0;
}

This time, however, the code fails all but the simplest of test cases, in each case due to time limit constraints.

I understand that in practice modularization is good, but at this point my interest is more 'academic' than anything else. Is there something I'm missing about the actual program, or is this something related to how things are run behind the scenes that I'm unaware of?

EDIT: As suggested by PaulMcKenzie, I've used g++ to run both versions of the program, using one of the failed test cases as input. Both ran fine and produced the same output, but the one with printKMax ran in 3.97s while the one without ran in 5.44s. It takes about 37% longe.

Solution

You second version is slower, because you're interleaving your input and output.

Reading from cin causes any output buffered for cout to be flushed first. The idea is that there is a user that needs to see your prompt.

Since you switch between reading and writing many times for each test, it results in many I/O operations to write to the console, which is not so fast.

See: https://www.geeksforgeeks.org/fast-io-for-competitive-programming/ and https://en.cppreference.com/w/cpp/io/basic_ios/tie

you can fix it with cin.tie(NULL);