Appendix Similar T : Why it probably works
similar touches the topic of Content Based Image Retrieval. Most of the
CBIR systems use color histograms with sophisticated statistical operations
or shape detection or detection of contrast peaks.
similar simply makes condensed samples in a standard size (currently 4x4).
It expectetly finds identical images, but why does it find many alterated
ones too ? How may this ability be enhanced ?
Since the system does not analyze the images at all, it has to depend on
a low probability to show one of the very many false replies and on a
high probability to detect a duplicate if there is any.
If the likelihood of a /random/ alert is very low then an alert has a high
likelihood to be caused by some /systematic/ connection between the two
pictures. One of those systematic connections could be a common ancestor
of both images. This special connection type is what we are looking for.
With tolerances not too wide, it really seems to be the dominant one in
color_double_diff comparison.
How likely is the color_diff comparison algorithm to match two completely
random images with N pixels with a tolerance of T ?
Neglecting effects at the ends of the scale
P= 1/((256/T)^(N*3))
with N=16 and T=4 this results in 1/(2^288) or about 2e-87 .
But this calculation is much too optimistic. Images are not random and that
is a very important reason why it works at all. If they were random noise,
then out of mere statistical reasons the miniaturization of the images would
result in quite uniform samples.
Therefore let us now assume quite roughly that with the 16-pixel samples the
tolerance effectively distinguishes only about E/T different stages of color.
I found good reason in my test collection that E is larger than 32.
Maybe i am wrong, statistics was not my favorite topic of math. Actually i
found in 125000 samples the most crowded color notch to be only 2 times as
dense as the average. That would lead to E > 100 . Aliasing of the histogram
cannot be be worse than a factor of three if T>=1.
More problematic seems to be a slight bump in the curves of green and blue
where red reaches its maximum. Looks like there is some color correlation.
Arghh, how to calculate that effect ?
( see doc/collection_histogram_*.gif : horizontal line = average density )
Recalculating the probability of a false duplicate : 1/(2^144) = 4.5e-44
Finding non-identical but similar images
One may wish to extend the search in two main directions :
Images from the same source which show the same world scene and objects
Images from different sources which show altered views of the same picture
The object recognition aspect is currently beyond the scope of similar.
Not surprisingly, quite a lot of very similar world scenes will show up as
altered versions of each other and may be even found during the search for
nearly identical images.
Another type of human percepted similarity is the recognition of altered
2D pictures. This may be different scans of the same picture or postprocessed
versions of the same scan.
Some types of alteration of the same picture :
Annotations and limited manipulations
Global Color changes
Frames
Different outtakes
Annotations are handled by the blind spot feature. A certain number of
mismatches is tolerated and regarded as result of limited manipulations.
A rough adaption to global color changes is provided by match-method
color_double_diff which shifts the colors so that the samples's average
colors become the same.
The tests with color_scaled_diff did not justify the additional processing
time. Maybe one should try to determine possible gamma corrections. (how ?)
Frames are naughty because they influence 12 of 16 sample pixels. Maybe one
should try to clip them, but how to decide in a fast and suitable way ?
Outtakes are even worse because the lack of visible borders.
Both may need a kind of holographic (resp. correlative) characterization
or partial image comparison which currently go beyond the scope of similar.
The influence of B blind spots further reduces selectivity
P= (N over B)/((E/T)^((N-B)*3))
where (N over B) is N!/(B!*(N-B)!) = N*(N-1)*...*(N-B+1) / (B*(B-1)*...*2)
With B=4 the calculation yields 1820/(2^108) = 5.5e-30
Even if we assume E to be as low as 16 it would still be 1820/(2^72) = 4e-19
(a lower E means a higher overall similarity of colors within an image
collection).
So let us assume 1e-20 as upper bound of random fault likelihood.
With that builtin selectivity, the likelihood for a completely false
duplicate in a collection of 100000 images should be about 1e-15 .
This likelihood is valid only for faults that we would consider to be without
any reason. For a collection that concentrates on a certain topic, there
will be a lot correlation of colors. For black and white images the
pessimisticly estimated likelihoot is about 1e-4. A dangerously low value
if one has to deal with a large collection.
With T=1 and B=2 we get 120/(16^14) = 1.5e-15 for black and white images.
Since method color_double_diff allows an arbitrary shift by a color vector
the reduction of selectivity is roughly estimated as 1 additional
blindspot (an arbitrary color vector too).
That would be 4368/(2^66) = 6e-17 for color parameters T=4, B=4
an 560/(16^13) = 1.2e-13 for b/w parameters T=1, B=2
Remember E=16 is quite a pessimistic value. These results should be upper
bounds of the undesired probability of uncorrelated false matches.
To estimate the likelihood for the whole cross-checking of a collection,
one has to consider the birthday paradox. According to hash lectures, one
may expect a first collision in a hash table with S slots after about
sqrt(2*S) tries. Our sample space may be seen as a hashtable with at least
1/6e-17 slots (resp. 1/1.2e-13 for b/w). sqrt(2*S) is 1.8e8 (resp. 4e6).
So, for the purpose of finding nearly identical images without embarrassing
false matches, the above parameters T and B should be sufficient at least
for a collection of a million images.
>>> not finished yet <<<
>>> intended color permutation test
>>> intended correlative characterization