I am attempting to consistently find the darkest region in a series of depth map images generated from a video. The depth maps are generated using the PyTorch implementation here
Their sample run script generates a prediction of the same size as the input where each pixel is a floating point value, with the highest/brightest value being the closest. Standard depth estimation using ConvNets.
The depth prediction is then normalized as follows to make a png for review
bits = 2
depth_min = prediction.min()
depth_max = prediction.max()
max_val = (2**(8*bits))-1
out = max_val * (prediction - depth_min) / (depth_max - depth_min)
I am attempting to identify the darkest region in each image in the video, with the assumption that this region has the most "open space".
I've tried several methods:
cv2
template matchingUsing cv2
template matching and minMaxLoc
I created a template of np.zeros(100,100), then applied the template similar to the docs
img2 = out.copy().astype("uint8")
template = np.zeros((100, 100)).astype("uint8")
w, h = template.shape[::-1]
res = cv2.matchTemplate(img2,template,cv2.TM_SQDIFF)
min_val, max_val, min_loc, max_loc = cv2.minMaxLoc(res)
top_left = min_loc
bottom_right = (top_left[0] + w, top_left[1] + h)
val = out.max()
cv2.rectangle(out,top_left, bottom_right, int(val) , 2)
As you can see, this implementation is very inconsistent with many false positives
Using np.argmin(out, axis=1)
which generates many indices. I take the first two, and write the word MIN
at those coordinates
text = "MIN"
textsize = cv2.getTextSize(text, font, 1, 2)[0]
textX, textY = np.argmin(prediction, axis=1)[:2]
cv2.putText(out, text, (textX, textY), font, 1, (int(917*max_val), int(917*max_val), int(917*max_val)), 2)
This is less inconsistent but still lacking
Using np.argwhere(prediction == np.min(preditcion)
then write the word MIN
at the coordanites. I imagined this would give me the darkest pixel on the image, but this is not the case
I've also thought of running a convolution operation with a kernel of 50x50, then taking the region with the smallest value as the darkest region
My question is why are there inconsistencies and false positives. How can I fix that? Intuitively this seems like a very simple thing to do.
UPDATE Thanks to Hans for the idea. Please follow this link to download the output depths in png format.
The minimum is not a single point but as a rule a larger area. argmin
finds the first x and y (top left corner) of this area:
In case of multiple occurrences of the minimum values, the indices corresponding to the first occurrence are returned.
What you need is the center of this minimum region. You can find it using moments
. Sometimes you have multiple minimum regions for instance in frame107.png
. In this case we take the biggest one by finding the contour with the largest area.
We still have some jumping markers as sometimes you have a tiny area that is the minimum, e.g. in frame25.png
. Therefore we use a minimum area threshold min_area
, i.e. we don't use the absolute minimum region but the region with the smallest value from all regions greater or equal that threshold.
import numpy as np
import cv2
import glob
min_area = 500
for file in glob.glob("*.png"):
img = cv2.imread(file, cv2.IMREAD_GRAYSCALE)
for i in range(img.min(), 255):
if np.count_nonzero(img==i) >= min_area:
b = np.where(img==i, 1, 0).astype(np.uint8)
break
contours,_ = cv2.findContours(b, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_NONE)
max_contour = max(contours, key=cv2.contourArea)
m = cv2.moments(max_contour)
x = int(m["m10"] / m["m00"])
y = int(m["m01"] / m["m00"])
out = cv2.circle(img, (x,y), 10, 255, 2 )
cv2.imwrite(file,out)
frame107
with five regions where the image is 0
shown with enhanced gamma:
frame25
with very small min region (red arrow), we take the fifth largest min region instead (white cirle):
The result (for min_area=500
) is still a bit jumpy at some places, but if you further increase min_area
you'll get false results for frames with a very steeply descending (and hence small per value) dark area. Maybe you can use the time axis (frame number) to filter out frames where the location of the darkest region jumps back and forth within 3 frames.