Follow-on Thoughts: Clustering Algorithm Improvements for Text-based Data Mining
A good night’s sleep is excellent for clearing away mental cobwebs, and has given me more perspective on Chapter 1, “Cluster-Preserving Dimension Reduction Methods,” by Howland and Park in Survey of Text Mining II: Clsutering, Classification, and Retrieval (ed. by Berry & Castellanos).
If you will, please recall with me that the Howland & Park work proposed a two-step dimensionality reduction method. They successfully reduced over 20,000 “dimensions” (of words found in the overall corpus collection) to four dimensions, and maintained excellent classification results.
Note also that their work was applied to a very structured domain (medical abstracts), and confined to five very distinct classes (heart attach, colon cancer, diabetes, oral cancer, tooth decay). We would expect some – but still very limited – overlap between colon cancer and oral cancer, and oral cancer and tooth decay. (It would be interesting to have reports on the centroid distances of these areas.)
Using their two-stage dimensionality reduction and clustering, they achieved results in the 98-99% range, which was an improvement (typically 10% or more) over simple centroid or k-nearest-neighbor clustering.
When we back off a bit and think this through, these results are not at all surprising.
First, the key terms identifying the concept areas should indeed be very distinct. The five selected topics should be VERY separable in a well-defined concept-space.
We know from experience that overly large feature vectors do far more harm than good, especially when unweighted. All the oddball, one-time-only terms do nothing except weaken the results that really depend on relatively few keywords.
Just about any method can and should remove those term “dimensions” that:
1) Are very sparse, both throughout the entire training / testing corpus AND
2) Are either so unique that their frequency of occurrence is not a reliable indicator of concept presence OR
3) Their presence – although sparse – is distributed over multiple concept classes.
Yes, this is coming at the problem from a top-down perspective, which betrays my bias immediately — but experience as well as theory have shown me that a little top-down knowledge, astutely coupled with some algorithms, can cut down on processing time and improve results reliability.
Be that as it may; they get anticipated results.
Which brings me to identifying the key questions that I believe truly confront the text mining / knowledge discovery community. These are:
1) Correct apportionment of a concept / entity across an ontology / taxonomy, and
2) The “mixtures problem.”
I wrote sufficiently about the first area in the previous blog, and touched briefly on the second. But it is this second one that requires substantive attention by the entire community if we are to have useful, reliable tools.
The term “mixtures problem” comes from the area of physical and analytical chemistry. Again, I’m showing my bias, but bear with me – this is an important analogy.
When we want to determine the composition of a chemical mixture, say a beaker full of organic solvents, we do a variety of tests — not just one! (This in itself should be a good heads-up for designing comprehensive text-mining tool suites.) For example, we use NMR to identify the presence of various elements. We use mass spectroscopy to determine the presence of various “fragments” within the chemical compounds. And we use a huge RANGE of spectroscopic methods (e.g., gas chromotography, infrared, etc.) to identify characteristic base chemical units (e.g., benzene rings, methyl groups, etc.).
Taking this over to the text-mining area, this immediately suggests that a suite of algorithms is useful, especially at the front-end. Even the crudest of algorithms – e.g., simple string-matching – can alert us to certain entities, and this will further direct us to certain kinds of processes, or certain aspects of context that will modify the processes which we select.
But back to analytical chemistry.
Suppose that we do our analyses, and get a set of the various chemical “base units” that are present. We can even identify full compounds by matching spectral peak patterns.
Moreover, using some fancy software, we can approximate the relative fraction of each compound in the mixture.
At first glance, this would seem to be a linear optimization problem.
And to a very reasonable and realistic extent, it is.
The strongest nonlinear effects — the degree to which, say, one methyl group attached to a benzene ring affects the spectral signature of another attached group — are within a given compound. These are well-known, and they form part of the distinct “spectral signature” that allows us to characterize a compound uniquely.
There will, however, be some degree to which the presence of one compound can affect the “signature” of another compound. This introduces a second-order, or nonlinear effect.
So back now to textual data mining.
We know that any document of interest will typically contain multiple concepts and entities. It stands to reason that when we classify the document, we should accept a range of match strengths against multiple concept classes. I am talking about concepts from MULTIPLE, very DISTINCT taxonomies / ontologies, as well as the “vertical spread” that we anticipate for matching levels of abstraction within a taxonomy.
We need to think through our method for this matching.
We can do something very simplistic; e.g., find the “best match,” then reduce the feature vectors by the amount proportional to that match, then find the “next best,” etc.
We can do a linear optimization spread, which essentially is performing the above but using all the potential matches simultaneously.
And I’m certain that there are other methods, open to our ingenuity. (If you have good references; please send; I’ll attempt to read / review / cite, although with probably the same long time-delay that I’ve had in putting this review together.)
Where the real challenge comes now is not so much in defining the method, as in defining our test criteria.
If we are going to have a good method for matching a document across a set of concepts / entities, we need to have ground-truthed data that has an agreed-upon evaluation.
Yes, there are certain databases (e.g., TREK, etc.) that have multiple concepts, etc. However, as we approach this more rigorously, we need to look more closely at second-order effects as we set up our evaluation data.
I mentioned that second order effects could happen — but were not likely to be too severe — in chemical mixtures.
My suspicion is that some of these effects can be VERY severe in text data. That is because we are dealing with a many-to-many (word-to-concept) match situation which is not subject to the laws of nature, but to human preferences and predilictions in writing and reporting style.
Good quantification should be a starting point.
Again, if you know of good work in this area, please share.
I will say that — if memory serves me right — this was NOT an issue that I saw addressed in this book, lovely though each of the chapters are. Nor have I seen it much in the technical material that has crossed my path over the last few years.
Comments, anyone?