Survey of Text Mining II: Cluster-Preserving Dimension Reduction Methods (Chapter 1)
Some time ago, I promised a colleague a review of an excellent book’ Survey of Text Mining II: Clustering, Classification, and Retrieval, edited by Michael W. Berry and Malu Castellanos. Overall, this book would serve well as the basis for a one-semester graduate course in specialized methods for (textual) data analytics. It presupposes an expert’s (or at least a solid journeyman’s) understanding of basic algorithms along with the issues of textual data mining / analytics. Each chapter presents a new and interesting insight. Taken together, the chapter collection reasonably spans the “space” of relevant algorithms; their usefulness, their limitations, and some means for honing them for improvement.
In that context, this book is useful both as a primary text for a graduate-level course, and also as a valuable reference to those doing research and development in this area.
What is missing, though, is a sense of how these various algorithms play into the larger picture. Limited through necessity, the positioning of this book deals with the specific subjects of clustering, classification, and retrieval. This means that there are some “bigger picture” issues that are not identified or brought together. Thus, this book should be used in the context of a greater understanding that an astute professor would provide to students, or an expert practitioner would bring to his or her own evaluation.
In this blog, over the next several days, I’ll be doing a detailed, chapter-by-chapter review. The review that I submit to my colleague will contain a summary of the most salient points.
This blog entry addresses Chapter 1: “Cluster-Preserving Dimension Reduction Methods for Document Classification,” by Peg Howland and Haesun Park.
Howland and Park state the challenge that they are addressing as: “To be useful, the lower dimensional representation must be a good approximation of the original document set given in its full space. Toward that end, we present mathematical models, based on optimization and a general matrix rank reduction formula, which incorporate a priori knowledge of the existing structure. From these models, we develop new methods for dimension reduction…”
They begin by identifying their initial dataspace representation as the classic mXn term-document matrix, where the entries are weighted frequencies of term i in document j. They note that clustering would be apparent in this term-document matrix, and they desire a dimensionality reduction method that preserves this clustering structure.
Their overall goal is to identify a computationally low-cost dimensionality reduction mechanism that preserves the cluster structure found in the original document collection.
To this end, they evolve a two-stage approach; specifically, they apply a combined linear discriminant analysis (LDA) combined with generalized singular value decomposition (GSVD) to the data set as a second stage after initial application of either principal components analysis (PCA) or latent semantic indexing (LSI). (They establish mathematical equivalence of the PCA and LSI methods using trace optimization.) Finally, they combine the two perspectives (PCA and LSI) by making a factor analysis approximation in the first stage.
Their mathematics is precise, their trace of intellectual antecedents is excellent. Time permitting, I’ll add a brief summary of their more important points.
But skipping now to their final result: They propose to overcome the computationally-expensive limits of LDA/GSVD method by applying it AFTER a factor analysis. They test their results on a very carefully controlled MEDLINE abstract document sets. The training and testing sets each comprise five medically very distinct categories (heart attack, colon cancer, diabetes, oral cancer, tooth decay); with 250 document abstracts in each category for each of the training and testing data.
The dimensionality of mXn was approximately 22000X1250, or about 20:1.
The two-stage (centroid followed by LDA/GSVD) classification accuracies for their training data range from 81.5 to 88.7%, depending on selection of parameters and centroid method. This is a slight improvement over centroid-alone methods (77.8 – 83.8%). When they apply their methods to “test corpora” of 200 abstracts selected randomly from the overall test corpus of 1250 abstracts, they achieve excellent results: 98.5 – 99%.
Clearly, if we desire to perform straightforward dimensionality reduction on a mXn data set, where m>>n, they offer a useful approach.
Now, let us (using the editorial, if not the imperial, “we”) briefly touch on potential limitations.
There is absolutely nothing wrong with their mathematics or methodology. They have a good process, it is well-formed and well-demonstrated. But let’s first keep in mind that their intention is to address data corpora where m>>n, and SVD methods are limited by this. What happens when we move to very, very large data corpora?
Similarly, their corpora — for all that they unarguably need a clean training /test set — are very simplistic. At a glance, we would expect easy discrimination between medical abstracts dealing with “heart attacks” vs. “colon cancer,” etc. The limitation of their application is that realistic classifications will involve:
1) Multiple classifications (or concepts) per document;
2) A hierarchy of classifications;
3) (Desirably) a weighted set of classifications across a hierarchy, i.e., positioning a document at the right “levels” of a hierarchical classification set, where the distribution of classification weightings gives a sense of how the document addresses the generality / specificity of a topical area.
Most clearly, the method that they devise seems useful in well-defined domains, but it is not easily clear how useful it would be extended to more “blurry” domains, to corpora in which the language was not as technically precise or well-structured, to corpora in which there were different reporting styles, etc.
The overall question of clustering becomes much more pertinent when we look at complex knowledge structures, expressed either as taxonomies or ontologies.
Suppose that we have a medical ontology. (I’m specifying ontology rather than taxonomy, because I’ll jump into objects which inherit from multiple parent classes in a moment — in other words, object-oriented thinking, which can be used to model ontologies.) Suppose that we have multiple high-level ontological structures in this medical ontology. One of these is the “cancer” ontology, which includes both oral cancer and colon cancer. (Both of these are classes within their study.) Suppose that we also have an “oral diseases” ontology – within the same large medical ontology. Within this “oral diseases” ontology, we have both oral cancer and tooth decay.
A document on oral cancer should match not only to the oral cancer “ontological node” — but also match (with perhaps less strength) to the higher ontologies of “cancer” and “oral diseases.” In other words, it should match with a weighted spread to a structure of potential classifications.
This does not diminish the value of the Howland and Park work, but does suggest that we need to think through the big picture requirements before selecting a particular mathematical method.
Second, we need to think through the fact that many high-value documents are characterized by participation in multiple classes (taxonomic or ontological) simultaneously. In this case, the documents will likely defy “simple” classification. They will also defy “simple” clustering. By their very nature, the value that these documents bring will be the association of multiple concept classes – whether spread over an ontological/taxonomic structure (e.g., comparing two related topics), or whether creating associations across disparate ontologies / taxonomies.
THIS is really the challenge that we need to address.
We can go on forever finding improved means for simple classification, but that obscures the objective of real textual data mining / text analysis. We are looking for the associations between concepts, not simple classification.
The best way to do this (and the subject of my knowledge discovery patent, forgive me for tooting my own horn, but this patent addressed this issue precisely), is to go back to the very core of AI thinking in terms of representation layers. (This is the basis of brain signal processing as well.) To achieve useful results, we need to transform data from lower to higher representation levels. We need to accomplish successive stages of data abstraction.
Dimensionality reduction is really not the same thing as data abstraction. It is close, and it is a useful method, but it is not the same.
This is not to disparage the method of Howland and Park. We simply need to hold it in context. Preferably, we use higher-order methods to identify desired “clusters” (actually, to identify known concepts and entities), and train classifiers / clustering methods to provide weighted maps of concepts / entities to documents as a first-level set of metadata.
The clustering methods (centroid, k-nearest-neighbor) offered in the Howland & Park work are examples of unsupervised learning. The follow-on methods (LDA/GSVD) are also unsupervised. Unsupervised methods have their role in the universe of text analytics, but we should not be ignoring the value of supervised methods, which essentially bring in a control mechanism to guide processes towards desired outcomes.
Yes, the latter does impose a “world view” on results. Thus, in addition to supervised methods to produce weighted, hierarchcial classifications, we also need novelty detection methods.
Thus, we hold the Howland & Park method in context. A good mathematical tool. Well-thought-out. Potentially useful, if we carefully control the conditions under which it is used. But let us also keep our eye on the big picture, and our attention tuned to the full set of necessary capabilities.
This set of comments — meant to place the reviewed works in appropriate context — will be similar for most of the chapters in this book. For that reason, once again — the book is useful. But it needs to be presented (or considered) within appreciation for the overall challenges of the problem space. It is a set of (mostly) useful tools; each of which addresses a unique, particular, and relatively small set of the big problem. “Big picture thinking” is still required.