Auto crop black borders from a scanned image by making stats about gray values (Java)

I'm writing a piece of code to automatically detect black noisy borders on scanned images and crop them off. My algorithm is based on 2 variables: gray mean value (of the pixels in a rows/columns) and position (of a row/column in the image).

GRAY MEAN VALUE
Images are in gray scale: this means that any pixel has a gray value in the range 0 (black), 255 (white).
For each row/column of pixels, I estimate the mean gray value for all the pixels in that row/column.
If the result is dark, then the current row/column is part of the border to cut off.

POSITION
The position is the distance in pixel of a row/column from the top left corner of the image.

Take a look at the following images for a better idea.
Thumbnail of a scanned image:

Resulting chart:

It is very easy, by looking at the chart, to estimate where the cropping points are because of the following rule: the most of the samples are in a white narrow range (150-200) which is the actual paper, then in the tails there is a quick change to dark values.
Those quick changes are the cropping points (Notice also that in the really end of tails there can still be white for a few pixelx, but this seldom happens).

I want to do it automatically, is there any statistics which can help me out?
PS: I'm a computer engineer, I've studied some statistics but... too many years ago!!

In the best case scenario the code should work with any scanned image affected by the black border problem, but, getting real, I'll be satisfied to make it work with these samples:
https://docs.google.com/folder/d/0B8ubCWBwsuOON3d1VVo4Z1AxWDA/edit

Solution

Preprocessing the image makes the statistics work out easier. For your case, a morphological closing with a wide horizontal line followed by Otsu thresholding (statistically optimal) makes the task a lot easier. The morphological opening is interesting here because in will specially make the paper region much lighter. You have two examples where the border region is fuzzy, i.e. it contains light parts too, but that doesn't make this step useless. After that, it is only a matter of summing by column and by row, and delimiting the border based on the mean and standard deviation. If the value is below mean - x*stddev, then it is outside of the paper. This way you can define the top left and bottom right corners for the paper, which you use to crop the image. The easiest way to define such corners is by linearly traversing forwardly and backwardly the sums found, stopping when the earlier condition isn't met.

For your images, x in the range [-1.5, -1] works (as well others, I tested around there). I fixed the size of the horizontal line for the closing operator at 101 points. Here are the results (corners coordinates could be included if needed for comparison):

enter image description here

The problem, as has been pointed out, is that some of these images also contain white borders as in the next case (which are connected to the paper). To handle that, after the image is a binary one consider applying a morphological opening as that will hopefully disconnect the components. You can use a large structuring element, I used one of dimensions 51 x 51, which is not that big for the size of your images. The main limitation is the implementation of the library you are using, as this can get slow if the implementation is a bad one (scipy in specific does not have a fast implementation). After that, keep only the largest component and proceed as usual.

enter image description here

Sample code:

import sys
import numpy
import cv2 as cv
from PIL import Image, ImageOps, ImageDraw
from scipy.ndimage import morphology, label


img = ImageOps.grayscale(Image.open(sys.argv[1]))
im = numpy.array(img, dtype=numpy.uint8)

im = morphology.grey_closing(img, (1, 101))
t, im = cv.threshold(im, 0, 1, cv.THRESH_OTSU)

# "Clean noise".
im = morphology.grey_opening(im, (51, 51))
# Keep largest component.
lbl, ncc = label(im)
largest = 0, 0
for i in range(1, ncc + 1):
    size = len(numpy.where(lbl == i)[0])
    if size > largest[1]:
        largest = i, size
for i in range(1, ncc + 1):
    if i == largest[0]:
        continue
    im[lbl == i] = 0


col_sum = numpy.sum(im, axis=0)
row_sum = numpy.sum(im, axis=1)
col_mean, col_std = col_sum.mean(), col_sum.std()
row_mean, row_std = row_sum.mean(), row_sum.std()

row_standard = (row_sum - row_mean) / row_std
col_standard = (col_sum - col_mean) / col_std

def end_points(s, std_below_mean=-1.5):
    i, j = 0, len(s) - 1
    for i, rs in enumerate(s):
        if rs > std_below_mean:
            break
    for j in xrange(len(s) - 1, i, -1):
        if s[j] > std_below_mean:
            break
    return (i, j)

# Bounding rectangle.
x1, x2 = end_points(col_standard)
y1, y2 = end_points(row_standard)

#img.crop((x1, y1, x2, y2)).save(sys.argv[2]) # Crop.
result = img.convert('RGB')
draw = ImageDraw.Draw(result)
draw.line((x1, y1, x2, y1, x2, y2, x1, y2, x1, y1),
        fill=(0, 255, 255), width=15)
result.save(sys.argv[2]) # Save with the bounding rectangle.