Introduction to Clustering
‘Clustering’ of documents is the grouping of documents into distinct classes according to their intrinsic (usually statistical) properties. In ‘clustering,’ we seek features that will separate the documents into natural groups based entirely on the internal properties of the collection. Ideally, the groups will be completely separate and as far apart as possible in feature space. But sometimes, overlap of clusters is unavoidable. [van Rijsbergen, 1979] Since clustering depends on the statistical properties of the collection being clustered rather than on matching the documents against some external set of queries, it is normally (but not always – see below!) applied to a preexisting collection rather than an incoming stream of documents as in a routing application. Why should documents be clustered? The basic reason is that clustering can reveal the intrinsic structure of a collection, e.g., by topic, subtopic, etc., (assuming of course, that there is a significant internal structure). If a language-independent statistical method such as ‘n-grams’ is used, a collection may also be clustered by language or document type, by topic within language, etc. (See section 3–)Moreover, by the ‘cluster hypothesis,’ ‘closely associated documents tend to be relevant to the same requests.’ [van Rijsbergen, 1979]
Of course, there is no guarantee that the cluster hypothesis is widely satisfied. It can only be verified empirically in any given collection. (See section on cluster validation.) In general, it is possible that a given algorithm will not generate any clusters, or that the clusters will overlap too much to be useful, or that the clusters which are formed will not correspond to meaningful topics of interest to prospective users. In an earlier section on relevance feedback (see above), another possible complication was pointed out: the documents relating to a given topic may form not one, but two or more separate clusters. However, it should be noted that all statistical IR techniques assume that it is possible to separate a collection of documents into at least two classes with respect to any given query, i.e., relevant and non-relevant documents. Hierarchical clustering offers the potential for very fast retrieval search time since most of the searching involves cluster representatives rather than all the documents in each cluster. However, if documents are fully indexed, vector space or boolean retrieval without clustering may provide retrieval time as fast or better since the only documents that need to be searched are those whose terms match the terms in the given query. If the query is well-formulated, these documents may be only a very small fraction of the collection.
The great virtue of most proposed clustering methods is that they are automatic. Of course, manual assignment of useful categories to each document in a collection is more certain to be useful, ,but it is very time-consuming and requires substantial manpower for large collections. Automatic clustering offers the hope of eliminating much of this effort. In the preceding section, an intermediate approach was described: Subject categories were assigned manually to a training set; thereafter, categories were assigned to new documents automatically. Clustering requires some measure of the similarity between documents in document space. The widely used cosine similarity described earlier is an obvious choice. But other similarity measures are available. [van Rijsbergen, 1979] [Salton & McGill, 1983] [Korfhage, 1997] (See the discussion of document query similarity above.) Measures of dissimilarity can also be used, especially since the objective is to maximize the distance between clusters, e.g., between their centroids, in document space. A dissimilarity measure is, in essence, a distance measure, i.e., its value for any two documents D1 and D2 is greater the farther apart D1 and D2 are in document space. (Cosine similarity has the opposite property; it has its maximum value when two document vectors coincide, and has a value of zero when the document vectors are orthogonal. But 1 – cosine is a distance measure, increasing with angular distance.) Document clustering methods are generally distinguished from the descriptors used to characterize each document, and the similarity (or dissimilarity) function of those descriptors that is used by the clustering method. In general, the choice of inter-document similarity measure is independent of the choice of clustering method. And the choice of a similarity measure is (often) independent of the choice of the document descriptors that serve as independent variables of the similarity function. [Willett, IP&M, 1988] (But note that this is not true of the method of Zamir et al., described below.)
In general, research into clustering has focused on clustering methods, and algorithms for implementing these methods efficiently with respect to space and time. Willett concedes that the ‘similarity coefficient may affect the clustering that is obtained.’ The method of Zamir et al. [SIGIR 98], described below, provides at least preliminary evidence that a novel document descriptor, combined with novel inter-document and inter-cluster similarity functions, can produce dramatic improvement in cluster quality.
Two main strategies have been used for clustering. Both require a document-to-document similarity measure. The first strategy requires that the complete collection of documents to be clustered be available at the start of the clustering process. Hence, one may call this the ‘complete’ strategy. (One might also call it the ‘static’ strategy, since it assumes that the collection of documents is not going to change, i.e., documents are not going to be added or deleted, during the clustering process.) Most methods based on the complete/static strategy start by generating a list containing the similarity of every document pair in the collection. Hence, if the collection contains N documents, the list will contain N(N-1)/2 similarities. Methods using this ‘complete’ strategy are expensive because of the large number of similarities involved, and the large number of comparisons that must be performed as documents are combined into clusters, and clusters are combined into larger clusters.
A straightforward implementation requires that the inter document similarity matrix, containing O(N2) elements, must be searched O(N) times, once for each ‘fusion’ of two clusters. Hence, the time requirement is O(N3) and the space requirement is O(N2). More sophisticated implementations have reduced the time requirement to O(N2), but even this is prohibitive for document collections of realistic size such as the larger TREC collections. On the other hand , because these ‘complete’ cluster methods take advantage of the full set of inter-document similarities, they meet van Rijsbergen’s three criteria for theoretical soundness: [van Rijsbergen, 1979] [Salton, 1989] (1) The clustering is unlikely to be altered drastically as new documents are added, (2) Small errors in the description of documents lead to small errors in clustering, and (3) the method is independent of the initial order of document reference, e.g., the order of document pair similarities. Essentially, these methods ‘attempt to seek an underlying structure in the data [rather than] impose a suitable structure.