How do I diagnose an egregious difference in Python script (pandas aggregate) runtime between two machines?

I have the same Python script running on two machines, on the same input and giving the same output. It takes about 7 minutes to run on my laptop and 56 minutes to run on a server that I uploaded it to.

Obviously, the environment is different, but the difference in speed (8x!) was a little too egregious for me to just write it off as different environments. Both machines have plenty of memory, and the server has more cores than my laptop, but also more stuff running on it (plus I don't think the cores matter for this operation). It's normal for me to run stuff on both my laptop and the server. I've never had a 2x difference in speed before, not to mention a 10x one.

The script is pretty sample, most of the work is a pandas.DataFrame.groupby().aggregate(lambda), but on a large (~400 MB) table.

How can I diagnose what's causing the speed difference?

Below is what I tried so far, but I know very little about this, so it may or may not be interesting.

cProfile

I tried running python -m cProfile in both environments to see if any particular function call was taking a long time, but it seemed to be across the board:

Laptop:

  ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       93   56.976    0.613  372.694    4.007 {method 'get_result' of 'pandas.lib.SeriesGrouper' objects}
 16338970   46.629    0.000  250.104    0.000 aggregate_sv_calls_by_sv_id.py:42(<lambda>)
 18442616   33.808    0.000   56.950    0.000 {sorted}
 18442645   25.395    0.000  172.113    0.000 series.py:1033(__iter__)
78056747/78056745   15.405    0.000   15.420    0.000 {isinstance}
 18446903   14.235    0.000   34.129    0.000 dtypes.py:68(is_dtype)
 18443264   13.515    0.000   21.058    0.000 internals.py:3806(dtype)
 18442666   13.447    0.000   30.854    0.000 common.py:2192(is_datetime_or_timedelta_dtype)
 18449428   13.250    0.000   13.250    0.000 {hasattr}
 18442793   13.001    0.000   19.134    0.000 internals.py:3833(internal_values)
...

Server:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
312253983  175.564    0.000  275.203    0.000 {isinstance}
       93  153.294    1.648 3323.806   35.740 groupby.py:1885(_aggregate_series_pure_python)
 16338970  151.336    0.000  749.431    0.000 aggregate_sv_calls_by_sv_id.py:42(<lambda>)
 18443129  146.950    0.000  607.143    0.000 internals.py:2482(make_block)
 18442874  136.363    0.000  529.415    0.000 series.py:120(__init__)
 18443241  125.725    0.000  248.772    0.000 generic.py:2674(__setattr__)
 18442596  106.353    0.000 1336.629    0.000 internals.py:3868(get_slice)
 18442617  106.296    0.000  172.363    0.000 {sorted}
 18442596   89.203    0.000 2105.284    0.000 series.py:689(_get_values)
 18443048   84.777    0.000   91.065    0.000 base.py:506(_get_attributes_dict)
...

Since it seemed to be across the board rather than any one function I could isolate, I tried to see if I could reduce the problem to a minimum, complete, verifiable example... This was the best I got:

%timeit

Laptop:

In [5]: df = pd.DataFrame(np.random.randint(10, size=(10000, 50)))

In [6]: %timeit df.groupby(0).sum()
100 loops, best of 3: 5.54 ms per loop

In [7]: %timeit df.groupby(0).agg(lambda x: sum(x))
1 loops, best of 3: 124 ms per loop

In [8]: %timeit df.groupby(0, 1).agg(lambda x: sum(x))
1 loops, best of 3: 155 ms per loop

Server:

In [5]: df = pd.DataFrame(np.random.randint(10, size=(10000, 50)))

In [6]: %timeit df.groupby(0).sum()
100 loops, best of 3: 6.08 ms per loop

In [7]: %timeit df.groupby(0).agg(lambda x: sum(x))
1 loops, best of 3: 215 ms per loop

In [8]: %timeit df.groupby(0, 1).agg(lambda x: sum(x))
1 loops, best of 3: 327 ms per loop

So this doesn't quite capture the 8x speed difference. It only shows about a 2x speed difference.

But this size of difference (2x) I feel like I am used to chalking up to some kind of environmental factors that maybe I won't be able to figure out, like other load on the server or something like that, or different processors. If I'm willing to write off a 2x difference as different machines, does that mean that I also have to be willing to accept an 8x difference as just different machines? Or is there more interesting stuff to dig into here?

Conclusion:

My pandas version was 0.17.1 on my laptop and 0.18.1 on the server. I compared the same code running on my laptop between 0.17.1 and 0.18.1, and 0.17.1 ran about 4x as fast. So ~4x can be attributed to that, and ~2x can be attributed to platform differences between the laptop and server.

Solution

I suspect you're looking at two, maybe even three different causes.

By comparing the two profiles, I see:

the same routine (sorted), part of the standard distribution, executed around the same number of times, shows a 3X time difference. This routine is CPU and memory bound, has no other needs, so this may indicate a platform difference (faster CPU, slower memory, massive page faulting and possibly swap thrashing). The same can be seen in calls to aggregate_sv_calls_by_sv_id.py.
said routine executes one time less on the server, which indicates a difference between the run of the same algorithm. This might be the same as the next element, or indicate some sort of indeterminacy.
what appears to be the same routine, and is executed the same nontrivial number of times (93) on laptop and server, is called {method 'get_result' of 'pandas.lib.SeriesGrouper' objects} here, and groupby.py:1885(_aggregate_series_pure_python) there. This means there's a difference in the software base.

All in all, I'd go out on a limb and say that your software version is, per se, around 2.5 times faster than the server's. Then, the server is running the script 3 times slower. And 2.5 * 3 gives around that 8X factor that you observe, if this is reproducible. Otherwise, either the causes are really three and not two, and the above indeterminacy plays a part, or the 3X slowdown on server is due to aleatory circumstances (e.g. system load, very likely).