aku-aku: v.. To move a tall, flat bottomed object (such as a bookshelf) by swiveling it alternatively on its corners in a "walking" fashion. [After the book by Thor Heyerdahl theorising the statues of Easter Island were moved in this fashion.] source: LangMaker.com. Aku Aku also has another meaning to the islanders: a spiritual guide.
« sistersound | Main Page | Scheming »
Sapience: anti spambot human validation.
Posted by dav at 2003 October 23 11:46 AM
File under: Geek

Lately, I've been getting a lot of spam in my blog comments. Some people are getting a handful, but I've gotten at least a hundred, including a blitzkreig a week ago.

I hate spam in my email inbox, but I really loathe spam in my blog comments, and now I've done something about it.

The October issue of the eminent hacker magazine Dr. Dobb's Journal had an article by Paul Tremblett which outlined a way to use the Java 2D API in a servlet to render dynamically created images of a sequence of letters and numbers for use in web form validation. This has been gaining popularity as a method to deter bots from submitting information to webforms.

Unfortunately the magazine did not include the proper source code either in the article or in its online resource center, but Mr. Tremblett was kind enough to dig up some mostly working source code for his application and email it to me. I was able to use it to get working a rough version of the application described in his article. I then added an XMLRPC wrapper to it, obtained his permission to release it as open source under the BSD license, and dubbed it Sapience.

Next I started in on a hack for Moveable Type comments to use Sapience for comment form validation. The hack I came up with works like this.

Place Sapience.pm file in /extlib/Dav/MT/
In the MT admin interface, edit the comment templates to include <!--SAPIENCE--> within your comment form html.
Edit the file /lib/MT/App/Comments.pm to use the Sapience.pm mtehods to insert the Sapience image into the form and validate it on comment submission.

I've set up a project home for Sapience on sourceforge.

Note there are other methods for stopping spambots from posting to your Moveable Type blog, such as MT Blacklist which does not require so much hacking of your MT source.

There are a few things that need to be done in order for it to go 1.0. I need to figure out how to more exactly crop the image and I need to institute a clean up mechanism for the images on the server. I also need to tweak the random code string generation to ensure there is always a letter present, since perl's XMLRPC::Lite will transport the code in <int></int> instead of <string></string> if the code is all numbers.

There are also a few things I'd like to do, such as add color and new validation methods to further thwart bot usage. There could be instructions such as 'Enter only the blue characters.' Also there could be animation or audio vailidation methods I suppose.

I'm getting ready to leave on a 2.5 week business trip though, so I may not be able to finish it up quickly. At least in the meantime I can rest assured that the spambot will cease posting to my blog.

Comments:

How is this different from James Seng's captcha? These things are horrible solutions because they cut out a segment of our society from participation. I have a blind friend for whom I get very sad everytime I see someone get exicted about something that requires sight, but should not.

Even James Seng thinks these are bad ideas...

If you are going to implement something like this, you should also offer AT LEAST one other avenue of comment posting (i.e. aural code, a question for the user to answer, an email link and an apology)...

Posted by: Jay Allen on October 23, 2003 12:08 PM

Hi Jay,

It's essentially the same as captcha from an MT point of view. The main difference is that Sapience is an XMLRPC service allowing it to be used in any web form scheme. (Although I also think that Paul's image generation algorithm is better suited for bot prevention than James').

As for alternative methods of validation, that is a good point, and I have already considered it. In fact if you read my blog posting I specifically mention audio validation.

Thanks for your comments. I hope you don't think I'm trying to compete with your wonderful utility, especially as I felt it best to mention MT Blacklist in my posting since I believe it has merits over the Sapience solution. I'm just playing with alternative methods.

Posted by: Dav on October 23, 2003 12:29 PM

Hi Dav!
Yeah, essentially these kind of "Turing test" type deals are not the best way to go, eventhough they may seem so. As we all know it essentially closes the door on some people. To put it in a "high-level" way: we place the burden of the fight on folks who have nothing to do with it. Legit vsitors/commenters (which are the 99% majority) are neither the propagators (the spammers) or the victims (the blog "owners", us).

Since we cannot go to the source and fight the spammers themselves, it falls on us to deal with it.

Therefore, so far, and by far, the best solution is James Seng's Bayesian comment/ping filter. If we build this out in a distributed fashion and get the blog-makers to integrate it, it will be massively powerful and effective. Jay's Blacklist system is also good (both did magnificent jobs on their respective MT plugins, BTW), but is much more labor intensive, especially in the long run.

My 2 cents. ;)

(see, now if I were blind or drunk.. or blind drunk, I could not have posted this comment... ;)

Posted by: Boris on October 23, 2003 02:29 PM

Hi Boris :)

Hmm, actually I use Bayseian filtering for my email (via SpamAssassin in one case, and Mozilla in the other), and I'm quite pleased with it. I'll look into Seng's implementation. My gut feeling though is that the problem of accurately filtering blog comment spam via the bayseian method is much harder than filtering email spam. That is, I would expect a lot more false positives.

I also don't think that asking a human to prove she is a human is onerous. I've come across a number of these tests on web forms and never once thought "What a horrible burden I'm being forced to bear!" I just quickly answered the little test and moved on. A suite of tests accessible by any user would be a fine solution, and the false positives would be effectively zero (where positive means marked as spam).

Posted by: Dav on October 23, 2003 07:30 PM

you know, sometimes I read your journal because you do so much social-wise (and I don't have a life so I enjoy it when others do) and other times I read it because crikey, it's better than reading a computing magazine. This is one of those times.

Posted by: Laura on October 23, 2003 07:35 PM

Ello again!

Well the beauty of Seng's MT-Bayesian is that you don't actually delete potential spam unless you want to. To be more precise, you can use the special tags which control whether or not it gets displayed, or you don't. It also gives you a nice interface (fully intgrated into MT's) where it shows you all comments and you can decide whether or not to delete it (as well as "train" the system).

Also, keep in mind there are currently various forms of Blog Comment Spam. The first distinction is "human-submitted" or "bot-submitted". Right now it still seesm to be, respectively, a 70/30 split. So you've got an actual human sitting there submitting to your blog. (Crazy no?)

For sure we will see more bots. Of course.

As for the burden, don't think just of yourself... Think of a potentially huge amount of people (if not today, then in the future...)... and accumulated "man-hours" of "extra effort" just to comment on a blog entry...

Just some thoughts :)

must say though I DO like the little graphic your thing produces.. looks like chainmaille... very Medieval, very "hark! who goes there?! what is the secret word!?" Which is what it is isn't it... Think about it... The normal "name, email, URL" is a totally open and hackable "who goes there?".. "Why I am your king of course!" No authentication. "What is teh secret phrase?" "yacksblood!" (i'm sure if the will is strong enough these folks will build image readers that can read the code... and with blogging's exponential growth, the will just might get there...).

Ideally? Web-based decentralised encrypted authentication. For now? Filters.

That's my story and I'm sticking to it. ;)

Posted by: Boris on October 23, 2003 11:44 PM

Holy Crap, Boris man.

When you're right you're right. Minutes before you posted your last comment I got ANOTHER FREAKING BLOG COMMENT SPAM which must have been entered by a scum sucking human. That's wasn't you just trying to underline your point was it ;) ?

Sigh.

I agree totally with your "ideally a decentralised encrypted authentication" comment. I've been ranting about something similar for months now actually. Maybe I already ranted about that in Tokyo?

OK, well I'm still planning to look into the Bayseian thing (just been busy) and I'll probably give it a shot next.

Posted by: Dav on October 24, 2003 08:14 AM

> As for alternative methods of validation, that is a
> good point, and I have already considered it.
> In fact if you read my blog posting I specifically
> mention audio validation.

Great to hear that, Dav.

> Thanks for your comments. I hope you don't think
> I'm trying to compete with your wonderful utility,
> especially as I felt it best to mention MT Blacklist
> in my posting since I believe it has merits over
> the Sapience solution. I'm just playing with
> alternative methods

As I've said many times in reference to James' Bayesian method, this is not a competition. The more solutions we have at our disposal, the more prepared we will be to meet the attack head on. I responded only out of social and accessibility concerns. I only badmouth bad ideas, not "competing" ones. :-)

> Jay's Blacklist system is also good (both did
> magnificent jobs on their respective MT plugins,
> BTW), but is much more labor intensive, especially
> in the long run.

Oh, Boris... I'm disappointed... You haven't been paying attention. See answer to question 4, Ben's reference to MT-Blacklist and the section entitled How to make this better.

In the long-term, maintainance will be trivially easy...

Posted by: Jay Allen on October 24, 2003 09:43 AM

Dav: haha, no that wasn't me. As for the authentication, it's possible we discussed it at Gen's party. I've been ranting about it for months myself. ;)

Jay: busted! There is way too much on your blog to keep track of. My apologies. I will say this though: MT-Blacklist and MT-Bayesian, combined, into a single package, peer-to-peer enabled, would be formidable. :)

Posted by: Boris on October 24, 2003 02:58 PM

Jay: busted! There is way too much on your blog to keep track of.

That is a compliment and also a valid criticism. You have just motivated me to finally implement search on my blog. One would think that, having developed it, it would already be there, but alas, the cobbler's children...

I will say this though: MT-Blacklist and MT-Bayesian, combined, into a single package, peer-to-peer enabled, would be formidable. :)

I really should stop posting anywhere else on the web, shouldn't I? See my reply on Slashdot.

Thank God for Google... :-)

Posted by: Jay Allen on October 25, 2003 04:35 AM

Wow, that Jay Allen guy is an amazingly narcissitic jackass!

Posted by: Seth on October 25, 2003 07:43 AM

Dav,

It's always cool to check out your blog, and see what you're up to! Very inspirational, and you always seem to be doing such cool stuff! The anti-spam thing is great.... You guys are like the valiant defenders of the net, and those of us less tech saavy salute you! : )

Speaking of less tech saavy, I'll probably be emailing you at some point with questions about how to post pictures from your phone to your blog... my Sprint PCS phone finally bit the bullet, and it looks like I'm going to finally have to break down and buy one of these new camera phones (which of course means I have to give up my current "no contract" Sprint plan... that's how long I've had it!).

Sorry for the digression, but I had to write something, right! : )

t

Posted by: Thomas on October 25, 2003 10:05 AM

Hey Seth, blow me and get a dictionary... And you're wrong, but you aren't worth my time to prove it.

Posted by: Jay Allen on October 27, 2003 08:53 AM

Very cool.

Anti-spammism techno stuff, surfing and open source software and a blog all on one site ;-)

I actually happened on your page to check out stuff about surfing in panama.

I think I'll try and buy a used board there similiar to your trip.

Posted by: patrick on October 30, 2003 10:36 AM

Post a new comment:

Thanks for signing in, . Now you can comment. (sign out)

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)


Remember me?