Similarity search in high-dimensional vector spaces (Thesis, ETH Zürich 2000) (Q2739548)

The PhD thesis ``Similarity search in high-dimensional vector spaces'' of Roger Weber appeared at the ETH Zürich in 2000. It addresses the following problem: By the improvement of computers and networks, it is now possible to collect, maintain and search large collections of multimedia documents. The main problem is \textit{how to retrieve the relevant information efficiently}. Suppose that we are given a picture (or a small collection of pictures) and we want to search in a huge archive for ``similar'' pictures. The method proposed in the literature for this task works as follows: All the pictures involved are automatically transformed by image processing methods into high-dimensional vectors of ``features''. The similarity search problem of pictures is thus reduced to a search for ``Nearest Neighbors'' (= NN-search, for short) in the feature space (with respect to some distance measure that has to be appropriately defined). Previous solutions for this problem do not lead to satisfactory response times as the dimension of the feature space grows (i.e., \(> 100\) or even \(> 1000\)). Weber investigates in this thesis why previous methods were not fast enough and presents a new and considerably more efficient method for the NN-search in high-dimensional vector spaces. NEWLINENEWLINENEWLINEWeber makes three major contributions in his PhD thesis: NEWLINENEWLINENEWLINE1. The naive solution for the NN-search is a ``linear scan'' of the database that determines all the distances between the feature vectors of the input image(s) and the ones in the database and searches for the images with the smallest distance from the input. In recent time, more sophisticated methods for the NN-search problem based on hierarchical indexes have been proposed in the literature. As a first theoretical contribution, it is formally proved (and illustrated by experiments) in this thesis, that for sufficiently high dimensions the brute force linear scan is better than hierarchical methods. NEWLINENEWLINENEWLINE2. After having shown that linear scans cannot be avoided, Weber presents a new method to speed up these scans. The idea is to introduce so-called ``Vector-Approximation files'' (= VA-files, for short) which are basically rough (and small!!) approximations of the original feature vectors. Then the NN-search proceeds in two steps: First, by making use of the approximations only, the whole set of database elements is restricted to a much smaller subset by computing certain upper and lower bounds on the distances. Then, the final result is determined by using the entire feature vector. However, this second step is now only applied to the subset of the database resulting from the first step. Note that the term ``vector-approximation files'' is slightly misleading: The approximations are used to exclude certain elements that cannot be part or the result. However, the result obtained is precise and not just an approximation. NEWLINENEWLINENEWLINE3. Finally, a further speed-up of the VA-file method is obtained in the following two ways: First, several approximate search methods are presented which lead to smaller response times at the expense of a small error in the result. Second, it is described how the work of the FA-file method can be distributed over parallel computers thus also reducing the response time. NEWLINENEWLINENEWLINEThis is an excellent PhD thesis. It tackles a highly relevant problem and comes up with new results that are very interesting both from a theoretical and a practical point of view. Theoretically, this thesis provides new insights into the pecularities of NN-search in higher dimensional vector spaces. Practically, the author constructs new algorithms and methods to overcome the difficulties of high dimensions. Moreover, all the theoretical considerations on the new algorithms are backed up by experiments with real implementations of these methods. Finally, one should also mention that the presentation of the results makes the reading of this thesis really enjoyable. On the one hand, all the results are formally stated and proved. On the other hand, the author provides a lot of explanation and motivation in order to convey the underlying ideas of the results. Every important theorem (or group of theorems) is surrounded by an overview and an informal introduction as well as an extensive discussion of the ideas. Moreover, the thesis contains a very good overview of related ideas presented in the literature and a comparison of different approaches. It thus becomes clear how the new results obtained in this thesis fit into the overall picture of research in this area.

0 references

reviewed by

Reinhard Pichler

0 references