contextual network graphing moveable type

Goodreads

Dav's bookshelf: read

See you at the 7: Stories From the Bay Area's Last Original Mile House

See you at the 7: Stories From the Bay Area's Last Original Mile House

by Vanessa Garcia

There's a little dive pub (turns out actually not a dive anymore) I'd been meaning to go to for years, and finally stopped by a couple of weeks back. I love checking out the old San Francisco spots that persist through the decades and ha...

The Undefeated

by Kwame Alexander

Wonderful poem and great illustrations.

Categories

When I was at the Emerging Technology conference back in April, one of the sessions that got me all hot and bothered was Maciej Ceglowski's presentation on semantic mapping. Specifically the Contextual Network Graph system which he described and released into the public domain. The CNG system is able to handle a keyword search and bring up documents that are similar even though they don't contain the keyword. One example Maciej gave was that you could do a search on 'photosynthesis' and have documents returned that contained only the phrase 'plants get their energy from sunlight.' It is able to do this by noting which words might be shared between documents that contain the keyword and documents that don't, and their relative importance.

I could see many uses for a generalized version of this system, including using it in my day job for doing searches in the chemistry space. Unfortunately Maciej didn't release any code for the CNG at the time, and the code he did release (for a patented system called LSI) was written in C++ and I got over C++ a long time ago. So I had been meaning to implement CNG in Java ever since but just didn't get around to it.

Last night I discovered that Anselm, my collegue in the Headmap Collective, has been working with CNG for a couple of weeks now and just completed a working alpha level implementation in C# (he also has a detailed explanation of CNG at that previous link). Since C# is based on Java, it was trivial to port it, so it is now available in the headmap cvs tree.

After porting it over I wanted to be able to pull in some external documents to play around with it. I wrote a quick system that reads an export file from a Moveable Type blog, stores it as a CNG where each blog posting is a separate document, and then lets you interactively specify a post and get back a list of the other posts sorted by similarity.

If you want to play around with it yourself, you can download the headmap-cng.jar jar file and an export file (obtain that from your MT blog menu), then run the utility from the command line like this:
java -classpath headmap-cng.jar org.headmap.cng.util.SemanticMT myblog.txt
You'll need Java 1.4 installed on your system.

It's not very useful yet; the algorithm needs to be tweaked and a better utility needs to be written. But it has some promise and interesting applications. Download the source code (headmap-cng-src.zip) and explore it!

Comments

Came here via your link in the boingboing comments. Interesting stuff! In a quick look over the document at headmap you linked, two things stand out:

1) Just as the propagation of significance through a word is scaled down by how common (document-diverse) the word is, propagation through a document should be scaled down by how word-diverse the document is. This would help maintain specificity.

2) Instead of dividing by the square root of word occurrences as a scaling factor, I'm recklessly guessing based on info. theory that it should probably be something related to -log (probability of word occurrence). Same-but-reversed for documents: divide by -log (number of indexed words in document / total indexed words). I know, I know, math first, then post. Sorry.

Posted by: Dan | 2003.07.14 at 08:27 PM

Did I say divide by -log(P)? I meant multiply.

Posted by: Dan | 2003.07.14 at 08:45 PM

You can check out my Search::ContextGraph module (http://www.cpan.org) for an example of how to add local and global term weighting into the model. It's Perl, but the weighting code will be analogous in Java or C#.

Posted by: Maciej Ceglowski | 2003.07.23 at 10:45 AM

Dan, thanks I'll try this out...

And Maciej, I'll also look at your perl module, thanks!

Posted by: Dav | 2003.07.24 at 02:29 AM

About

Twitter

Goodreads

Dav's bookshelf: read

Upcoming

Archives

Categories