Speed of lambda vs. inlining the function

I have speed issues with lambda functions. Here is the code:

Lit Simplifier::lit_diff_watches(const OccurClause& a, const OccurClause& b)
{
    set_seen_for_lits(b, 1);

    size_t num = 0;
    Lit toret = lit_Undef;
    const auto check_seen = [&] (const Lit lit) {
        if (seen[lit.toInt()] == 0) {
            toret = lit;
            num++;
        }
    };
    for_each_lit(a, check_seen);
    /*switch(a.ws.getType()) {
        case CMSat::watch_binary_t:
            check_seen(a.lit);
            check_seen(a.ws.lit2());
            break;

        case CMSat::watch_tertiary_t:
            check_seen(a.lit);
            check_seen(a.ws.lit2());
            check_seen(a.ws.lit3());
            break;

        case CMSat::watch_clause_t: {
            const Clause& clause = *solver->clAllocator->getPointer(a.ws.getOffset());
            for(const Lit lit: clause) {
                check_seen(lit);
            }
            break;
        }
    }*/
    set_seen_for_lits(b, 0);

    if (num == 1)
        return toret;
    else
        return lit_Undef;
}

The for_each_lit function's signature is:

void for_each_lit(
   const OccurClause& cl
    , std::function<void (const Lit lit)> func
);

The function lit_diff_watches runs millions of times, and it takes 3.3s on an example. However, when I uncomment the switch, and comment out for_each_line (which is a copy-paste of the switch), I get 1.7s for the same exact run. Note that 99% of the time, watch_binary_t or watch_tertiary_t happens, i.e. only a very few instructions should be executed per lit_diff_watches function call.

Can you please tell me what I'm doing wrong? The behavior is the same for both GCC 4.7 and current llvm-svn (25th of Nov 2013), the timing differences are minor. I'm guessing that the function call is not inlined but I'm not an expert. I would like to have this fixed, because this switch(..){..} is in many-many places in the code, and the use of lambdas & for_each_lit would significantly clean up the code. However, I cannot loose so much speed over this. 10-20% would be fine, but a slowdown of almost 2x is too much.

Solution

cbreak-work helped me on the #c++ freenode IRC channel. He suggested what "Kerrek SB" wrote above in the comments. I changed for_each_lit's declaration to:

template<class Function>
void for_each_lit(
    const OccurClause& cl
    ,  Function func
);

and the speed is now the same for both implementations. Apparently, the compiler couldn't do the link-time optimization to get the virtual function call out of the way, and so the overhead was huge.

Thanks for all the help to everyone!