Search code examples
phpsessionweb-crawlergoogle-crawlers

How to handle PHP sessions efficiently?


This is what my init.php looks like that is loaded across the whole website:

$suid = 0;
session_set_cookie_params(60, '/', '.' . $_SERVER['HTTP_HOST'], true);
session_save_path(getcwd() . '/a/');
if (!isset($_SESSION['id'])) {
    session_start(['cookie_lifetime' => 60]);
    $_SESSION['id'] = session_id();
    $_SESSION['start'] = date('d_m_Y_H_i');
    $_SESSION['ip'] = $_SERVER['REMOTE_ADDR'];
} elseif (isset($_SESSION['uid'])) {
    $suid = $_SESSION['uid'];
}

I'm currently testing PHP sessions, so I just put 60sec as lifetime.

I was wondering why there were sessions created since no one knows the domain yet, so I added ip. I looked it up and found this out:

enter image description here

So it was the Google crawler bot. Since there are more search engines and bots out there, I don't want to save these crawls in my session files and fill up my webspace with that.

So my questions are:

1) Even when the test lifetime value (60 seconds) is over, the session file remains in the custom directory. I read this is because I set a custom directory. Is this true?

2) What would be an efficient way to delete all non-used/expired session files? Should I add $_SESSION['last_activity'] with a timestamp and let a cronjob look in my custom dir, get the session file data and calculate expired sessions to delete it?

3) Should I avoid saving those unneeded sessions by those bot crawlers just looking for the string "bot" inside $_SERVER['HTTP_HOST'] or is there a better way to identify "non-human visitors"/crawlers?

I also appreciate any improvements/suggestions to my code at the top. I just caused some Internal Server Error previously, because session_start() has been called to often as far I can tell from php-fpm-slow-logs.


Solution

  • 1) Even when the test lifetime value (60 seconds) is over, the session file remains in the custom directory. I read this is because I set a custom directory. Is this true?

    No, the custom directory is picked up by the session GC and the files will be cleaned up. It just doesn't happen immediately.

    2) What would be an efficient way to delete all non-used/expired session files? Should I add $_SESSION['last_activity'] with a timestamp and let a cronjob look in my custom dir, get the session file data and calculate expired sessions to delete it?

    PHP 7.1 has session_gc(), which you can call from a cronjob and it will do everything necessary.

    On older PHP versions, you'd rely on the probability-based GC by default, where cleanups are performed randomly.
    This may not be particularly efficient, but it has been the only universal solutions for over a decade, so ...

    However, if your server runs Debian, it likely has session.gc_probability set to 0 and using a Debian-specific crontab script to do the cleanup at regular intervals - you will have problems with a custom directory in that case, and there are a few options:

    • Manually re-enable session.gc_probability.
    • Configure session.save_path directly in your php.ini, so the default cron script can pick it up.
    • Don't use a custom dir. Given that you currently have getcwd().'/a/', I'd say the default sessions dir on Debian is almost certainly a more secure location, so it would objectively be a better one.
    • Write your own cronjob to do that, but you have to really know what you're doing. $_SESSION['last_activity'] is not even usable for this; file access/modified times provided by the file-system itself are.

    3) Should I avoid saving those unneeded sessions by those bot crawlers just looking for the string "bot" inside $_SERVER['HTTP_HOST'] or is there a better way to identify "non-human visitors"/crawlers?

    You're thinking of $_SERVER['HTTP_USER_AGENT'], but no - this isn't a solution.

    Little known (or largely ignored, for convenience), but the only way to do this correctly is to never start a session before login.

    The annoyance of crawlers triggering useless session files is a neglible issue; the real concern is a determined attacker's ability to fill up your session storage, use-up all possible session IDs, avoid session.use_strict_mode - none of these attacks are easy to pull off, but can result in DoS or session fixation, so they shouldn't be easily dismissed as possibilities either.

    P.S. Bonus tip: Don't use $_SERVER['HTTP_HOST'] - that's user input, from the HTTP Host header; it might be safe in this case due to how cookies work, but it should be avoided in general.