::::::::
To that end, I used the Cluto Clustering Toolkit to help me cluster over a year's worth of (about 14,000) OpEdNews articles.
Here I present some preliminary results, in the hope of soliciting feedback and suggestions.

OpEdNews by Rob Kall (modified by Don Smith)
The table here shows, for the first 1000 articles, the keywords that the users supplied, as well as the cluster keywords generated by Cluto.
Clustering algorithms aim to find natural categories. A cluster is a group of objects that are similar. For articles, similarity means: they have many of the same words. The words that are shared by the articles in a given cluster are candidates for the tags.
To run this experiment, I first needed to collect data. Since Nov 2010 I've been storing RSS feeds of articles and stories from OpEdNews and 70 other websites into a database. I wrote a program to extract from each article a list of the words, pairs of words, and counts of occurrences (ignoring common words). This yielded a matrix. Cluto then processed the matrix and found the clusters and the cluster keywords. I then wrote a program to make the html file.
I'm hoping that the algorithm could eventually be used to automatically categorize articles.
The system still needs better handling of synonyms (e.g., "health care" and "medical care") and word stemming (e.g., "Afghanistan" and "Afghan" or "corporate" and "corporations").
Additionally, there are numerous parameters and algorithms to choose from when clustering with Cluto. Most importantly, how many clusters should there be? Too many clusters and you get weird clusters that reflect random similarities among articles (over-fitting). Too few clusters and there is insufficient resolution of important distinctions. This page you're now reading shows clustering with 30 clusters; this other page here show clustering with 40 clusters.
Since authors have already supplied keywords for the articles, and since the supplied keywords are generally accurate, it might be possible to "train" a system from the labeled articles.
Click here for a list of tagged articles.



