Appendix Similar T : Why it probably works similar touches the topic of Content Based Image Retrieval. Most of the CBIR systems use color histograms with sophisticated statistical operations or shape detection or detection of contrast peaks. similar simply makes condensed samples in a standard size (currently 4x4). It expectetly finds identical images, but why does it find many alterated ones too ? How may this ability be enhanced ? Since the system does not analyze the images at all, it has to depend on a low probability to show one of the very many false replies and on a high probability to detect a duplicate if there is any. If the likelihood of a /random/ alert is very low then an alert has a high likelihood to be caused by some /systematic/ connection between the two pictures. One of those systematic connections could be a common ancestor of both images. This special connection type is what we are looking for. With tolerances not too wide, it really seems to be the dominant one in color_double_diff comparison. How likely is the color_diff comparison algorithm to match two completely random images with N pixels with a tolerance of T ? Neglecting effects at the ends of the scale P= 1/((256/T)^(N*3)) with N=16 and T=4 this results in 1/(2^288) or about 2e-87 . But this calculation is much too optimistic. Images are not random and that is a very important reason why it works at all. If they were random noise, then out of mere statistical reasons the miniaturization of the images would result in quite uniform samples. Therefore let us now assume quite roughly that with the 16-pixel samples the tolerance effectively distinguishes only about E/T different stages of color. I found good reason in my test collection that E is larger than 32. Maybe i am wrong, statistics was not my favorite topic of math. Actually i found in 125000 samples the most crowded color notch to be only 2 times as dense as the average. That would lead to E > 100 . Aliasing of the histogram cannot be be worse than a factor of three if T>=1. More problematic seems to be a slight bump in the curves of green and blue where red reaches its maximum. Looks like there is some color correlation. Arghh, how to calculate that effect ? ( see doc/collection_histogram_*.gif : horizontal line = average density ) Recalculating the probability of a false duplicate : 1/(2^144) = 4.5e-44 Finding non-identical but similar images One may wish to extend the search in two main directions : Images from the same source which show the same world scene and objects Images from different sources which show altered views of the same picture The object recognition aspect is currently beyond the scope of similar. Not surprisingly, quite a lot of very similar world scenes will show up as altered versions of each other and may be even found during the search for nearly identical images. Another type of human percepted similarity is the recognition of altered 2D pictures. This may be different scans of the same picture or postprocessed versions of the same scan. Some types of alteration of the same picture : Annotations and limited manipulations Global Color changes Frames Different outtakes Annotations are handled by the blind spot feature. A certain number of mismatches is tolerated and regarded as result of limited manipulations. A rough adaption to global color changes is provided by match-method color_double_diff which shifts the colors so that the samples's average colors become the same. The tests with color_scaled_diff did not justify the additional processing time. Maybe one should try to determine possible gamma corrections. (how ?) Frames are naughty because they influence 12 of 16 sample pixels. Maybe one should try to clip them, but how to decide in a fast and suitable way ? Outtakes are even worse because the lack of visible borders. Both may need a kind of holographic (resp. correlative) characterization or partial image comparison which currently go beyond the scope of similar. The influence of B blind spots further reduces selectivity P= (N over B)/((E/T)^((N-B)*3)) where (N over B) is N!/(B!*(N-B)!) = N*(N-1)*...*(N-B+1) / (B*(B-1)*...*2) With B=4 the calculation yields 1820/(2^108) = 5.5e-30 Even if we assume E to be as low as 16 it would still be 1820/(2^72) = 4e-19 (a lower E means a higher overall similarity of colors within an image collection). So let us assume 1e-20 as upper bound of random fault likelihood. With that builtin selectivity, the likelihood for a completely false duplicate in a collection of 100000 images should be about 1e-15 . This likelihood is valid only for faults that we would consider to be without any reason. For a collection that concentrates on a certain topic, there will be a lot correlation of colors. For black and white images the pessimisticly estimated likelihoot is about 1e-4. A dangerously low value if one has to deal with a large collection. With T=1 and B=2 we get 120/(16^14) = 1.5e-15 for black and white images. Since method color_double_diff allows an arbitrary shift by a color vector the reduction of selectivity is roughly estimated as 1 additional blindspot (an arbitrary color vector too). That would be 4368/(2^66) = 6e-17 for color parameters T=4, B=4 an 560/(16^13) = 1.2e-13 for b/w parameters T=1, B=2 Remember E=16 is quite a pessimistic value. These results should be upper bounds of the undesired probability of uncorrelated false matches. To estimate the likelihood for the whole cross-checking of a collection, one has to consider the birthday paradox. According to hash lectures, one may expect a first collision in a hash table with S slots after about sqrt(2*S) tries. Our sample space may be seen as a hashtable with at least 1/6e-17 slots (resp. 1/1.2e-13 for b/w). sqrt(2*S) is 1.8e8 (resp. 4e6). So, for the purpose of finding nearly identical images without embarrassing false matches, the above parameters T and B should be sufficient at least for a collection of a million images. >>> not finished yet <<< >>> intended color permutation test >>> intended correlative characterization