Tag(s): ; , Add Tags
Add to My Group

View Ratings | Rate It

Promoted to Headline (H4) on 1/24/11:      Permalink
View Article Stats      (1 comment)

Towards automatic tagging of OpEdNews articles

Add this Page to Facebook!
Submit to Twitter
Submit to Reddit
Submit to Stumble Upon

Tell A Friend

Become a Fan
Get Embed HTML Code
By (about the author)      
Become a Fan Become a Fan  (16 fans)

opednews.com

I present some preliminary results on automatically tagging OpEdNews articles via clustering (an AI technique).

::::::::

When authors submit articles to OpEdNews, they tag articles by choosing from a hierarchy of categories that has been built up over the years by editors and authors. It's often a challenge to choose the correct tag. It would be cool if software could automagically choose the correct tag for you, based on the content of the article.

To that end, I used the Cluto Clustering Toolkit to help me cluster over a year's worth of (about 14,000) OpEdNews articles. Here I present some preliminary results, in the hope of soliciting feedback and suggestions.


OpEdNews by Rob Kall (modified by Don Smith)

The table here shows, for the first 1000 articles, the keywords that the users supplied, as well as the cluster keywords generated by Cluto.

Clustering algorithms aim to find natural categories. A cluster is a group of objects that are similar. For articles, similarity means: they have many of the same words. The words that are shared by the articles in a given cluster are candidates for the tags.

To run this experiment, I first needed to collect data. Since Nov 2010 I've been storing RSS feeds of articles and stories from OpEdNews and 70 other websites into a database. I wrote a program to extract from each article a list of the words, pairs of words, and counts of occurrences (ignoring common words). This yielded a matrix. Cluto then processed the matrix and found the clusters and the cluster keywords. I then wrote a program to make the html file.

I'm hoping that the algorithm could eventually be used to automatically categorize articles.

The system still needs better handling of synonyms (e.g., "health care" and "medical care") and word stemming (e.g., "Afghanistan" and "Afghan" or "corporate" and "corporations").

Additionally, there are numerous parameters and algorithms to choose from when clustering with Cluto. Most importantly, how many clusters should there be? Too many clusters and you get weird clusters that reflect random similarities among articles (over-fitting). Too few clusters and there is insufficient resolution of important distinctions. This page you're now reading shows clustering with 30 clusters; this other page here show clustering with 40 clusters.

Since authors have already supplied keywords for the articles, and since the supplied keywords are generally accurate, it might be possible to "train" a system from the labeled articles.


Click here for a list of tagged articles.

 

http://waliberals.org

DFA organizer, Democratic Precinct Committee Officer, writer, and programmer. My op-ed pieces have appeared in the Seattle Times, the Seattle Post-Intelligencer, and elsewhere. See http://WALiberals.org and http://TruthSite.org for my writing, my (more...)
 

The views expressed in this article are the sole responsibility of the author
and do not necessarily reflect those of this website or its editors.

Contact Author Contact Editor View Authors' Articles

 

Share this page: (what's this?)                   Tell a Friend: Tell A Friend

Add this Page to Facebook!      Submit to Stumble Upon      Submit to Reddit      Add This Page to Mr Wong!           NEWSVINE      DEl.ICIO.US      Looksmart Furl      My Web      Blink List     (More...)

Comments

The time limit for entering new comments on this diary has expired.

This limit can be removed. Our paid membership program is designed to give you many benefits, such as removing this time limit. To learn more, please click here.

Comments: Expand   Shrink   Hide  
1 comments
To view all comments:
Expand Comments
(Or you can set your preferences to show all comments, always)

My (human) thoughts by Scott Baker on Wednesday, Jan 26, 2011 at 5:10:24 AM