A few people have emailed me recently disappointed by the level of spam postings on pastebin.com. I’ve never really understood why spammers bother, but as they are bothering in increasing numbers it was time to take some action.
Last night I built in some spam filtering which has caught hundreds of posts since going live. I also added a “report spam” link which has flagged over 500 posts in past 20 hours. By iteratively tweaking the spam filter to identify the legimately flagged posts, I’ve been able to quickly delete a lot of older spam posts.
Hopefully this will make pastebin look like a well tended garden rather than a run-down wasteland! Comments welcome…
Though we all love the spam… Oh come on.. you know you do.. :>
Anyhow, Paul, check out http://www.projecthoneypot.org.. It is quite effective, and you can redistribute it with your code as well (though I think an end user has to do a bit to get it going again, like getting their own account). It could compliment the new methods you implemented yourself.
“SPAM SPAM SPAM SPAM…..” – Oh the wonderful Monty Python tune!
Well done that man!
Easy way to kill most spam on the spot:
– rename the name/link/etc. fields to something weird
– put this in your form:
(Leave these fields blank!)
– put this in the site stylesheet:
div.trap { display: none; visibility: hidden }
Then when processing the form, if there’s anything in the ‘name’ or ‘link’ fields, drop the post. This method alone has cut my spam by about 90-95%. I have no captcha, and no wonky heuristics that are eventually bound to flag legitimate data as spam at some point or another. Just a couple of innocent-looking fields that spambots fill in because they look important.
… oh, ffs, it stripped out the html. pretend these parens are angle brackets. 😛
(div class=”trap”)
(Leave these fields blank!)
Goddamnit.
I give up.
Hah, it’s OK, I understood. I should try it, though I fear it would only work for a short period 🙁
In response to lordelph & Anonymous, it actually works for a very long period.. Still working for me. It even works with (input type=”hidden”) for some of the absolutely stone stupid bots.
Thanks for all the comments, I’ve continued to tweak it and the amount of spam (and spam reports) has fallen dramatically. Will keep an eye on it!
Ok, so please forgive the misplacement of this comment, but I can’t find anywhere else to put this. Anyway, I understand PasteBin is GeSHi based, so I thought I’d contribute a PowerShell syntax …
http://huddledmasses.org/jaykul/powershell-highlighting-for-geshi/
Implement Akismet ? *Could* be buggy for pasting code, but could work as well 🙂
How about picture? Maybe u should place here some kind of engine generating images with words or totally random letter to re-type? 90% would gone (spam-bots)
msg: I’m strongly against pictures, I really dislike filling in those things and it would atleast drive me away from this site. The hidden form fields works with most bots and if that’s not enough than a bit of javascript (perhaps in combination with ajax) is sufficient to kill most spambots.
Off-topic, but a request indeed. How about being able to add something to the command line to highlight specific lines? like http://pastebin.com/abcacbacb@5,16-32,79-150 or so (to highlight lines 5, 16-32, and 79-150 ;)). Would be awesome 🙂
@msg: The anti-spam measures are working very well, but I’ve had one report of a legimate post getting flagged as spam, so I will either relax things a little or add a CAPTCHA only for those posts with a spam smell them!
@Noccy: interesting, but would anyone really use it? I’ve got an idea for making the existing line highlighting features easier to use though…
(Not reading the previous comments, sorry, its just a quick note). Few moments ago I boldly clicked on a Spam report link and seemingly reported a valid code. Let me explain:
You know, as a webmaster I have a slightly daring attitude when speaking of web forms. I often click on things just to see if they offer any confirmation or how deal with invalid input. And, to my regret, your markup has not offered me any confirmation prompt, so that I could verify my decision.
Could you possibly add an … for your anti-spam links? Hope that helps the site. 🙂
Oops, an overactive code parser. Lets try again: Could you possibly add an a onclick=confirm(“Sure?”) href=http:// … for your anti-spam links? Hope that helps the site. 😉
The “report spam” link is immediate by design, to encourage its use. False positives are relatively rare and are ignored.
I agree with the hideen fields system, Other people have used it in their comment system, and it got rid of most of their spam. The spam bots dont think to check the CSS or sometimes even the field type. I strongly sugest this over CAPACA, as Capatcha (sp!) can ittitate users more than the benifit of the spam bots being detered.
BOO to spammers… great job 😀
To stop spam on a site. Just post a banner saying no spam allowed.
MrLight, if only it were that easy!
I don’t know If I’m the only person saying this, but what exactly DEFINES spam? it could be people just posting things to transfer them elsewhere…
Before I put the spam filter on, there were hundreds of posts which were just lists of links and keywords for typical spam enterprises, submitted multiple times.
If you didn’t see it before, trust me , it is pretty obvious when a post is spam!
For those posts flagged with the “Report spam” feature, I read them and think “could someone conceivably want to send this to another individual for comment or review?”. If I see a new pattern emerging, I tweak the automated filter appropriately.
I’ve only had one report of false positive so far, but happy to hear of such incidents…
Hey hey hey 🙂
I use pastebin with cURL scripts… To upload settings etc. from different computers. Please don’t restrict it too much 🙂 at least for private domains… The really simple (and therefore scriptable) page structure (form) is…delicious.
By the way, is there a way to retrieve posts that have fallen off the “Recent Posts” box?
Thank you for this service 8)
props to pastbin.
Report spam link is totally wrong for a simple reason: it’s a link. A GET request shouldn’t cause an action. If the request has a side effect, use POST (so same goes for the delete link). I can easily see a crawling bot marking all posts as spam.
Hmm actually no… Because it’s even worse: Why The Hell Javascript? What good reason do you have to block people from reporting spam when using lynx from a headless machine?
Totally wrong eh? Sorry about that. Thanks for the feedback though!
http://en.wikipedia.org/wiki/Hypertext_Transfer_Protocol#Safe_methods
Thanks again. But I really did want it to work the way it does. Sorry it annoys you so much though.
When will the anti-spam markup be available for download?
any chance I could get a download of the new code so I don’t have to write it myself? My pastebin is overrun with spam : (
Why don’t put an CAPTCHA filter? That will stop most of spammers…
Everybody hates CAPTCHAS. I think an agglomeration of inconspicuous methods (to the normal user) would be best.
For instance, setting a cookie and ensuring that the user sends it back will ensure that at least a somewhat functional browser is being used. (Not many bots will use cookies). If no cookie is sent back along with the paste form, you can display a kind red message to the user to enable cookies. (Cookies are necessary on an ever-increasingly dynamic Web!)
Not forgetting the hidden input methods too, and slightly medieval methods such as blacklisting on multiple detections and heuristic confirmations.
As pastebin software becomes more popular, more spammers will tailor their bots to specifically target it and bypass your specific anti-spam techniques. This is where CAPTCHA must simply come in, until better methods of differentiating a human brain and electronic processor are developed.
I think the spammer send link in order to increase their google rank ?
I cant post my logs any more, always flagged.
Can you post a few sample lines here, and I’ll tweak the spam detection…
“I cant post my logs any more, always flagged.”
Same here, I cant use my pastebin anymore. It says only: “Sorry, your post tripped our spam filter – let us know if you think this could be improved”
Thanks, now its normal again. 🙂
Have a check box that is “check this box if this is a spam post”. In the code make the checkbox look important and as though it must be clicked to post and bots will check that field and get flagged (or even better, banned).
Something like that already occurs. Most of the “spam” filtering is more about filtering posts I don’t want to be hosting.
I’ve found that when you put your comments on a separate page it seems to dramatically reduce the amount of comment spam, maybe because the comment page has a lower or zero page rank. It takes a bit away from the whole ‘flow’ of the blog but you gotta weigh up the pros and cons
Someone mentioned a honeypot earlier, as well as the problems with making the “report spam” and “delete” links GET requests instead of POST requests.
A simple solution is to make a “honeypot” anchor link in your invisible DIV. If the honeypot link gets followed by a “user” (of course, the text of the link should indicate that you shouldn’t click it, but this would only be visible to non-CSS browsers) then that’s an indication that you’re looking at a crawler that should be temporarily banned/ignored.
Meanwhile, to make sure that search engines work, make sure that the honeypot script (as well as the scripts that manage marking spam and deleting posts!) is listed in your robots.txt so that well-behaved crawlers know to ignore it.
Yay I just spammed – Try and stop me now ! Muwhahaha
I like your pastebin. I use it often, but i hate the spam. Why don’t you use captcha? It’s easy to implement.
I like the pastebin the way it is. Mostly I deal with people asking for help in IRC while I’m at work. If I have a minute I help people out.
I wouldn’t take the time to deal with a captcha if it was implemented. If anything I would perhaps change the code so the captcha appears iff the first spam method triggers.
Hi there,
am I missing anything? I just downloaded and installed pastebin from http://pastebin.com/pastebin.tar.gz just to realize that is hardly looks different to my older version.
I can’t see any means of spam detection or how spam could be avoided in pastebin-0.60.
Which version of pastebin are you talking about and if it’s not 0.60, where can I get the source?
Regards
Ralf
I haven’t packaged the latest release, will try rectify that in the next few days….
I keep getting tripped as spam but my post is not spam.Whats the problem?
it would be great if you can upgrade the version lordelph
I dont know why.. but your crapy SPam filter realy is anoying sometimes.. i wanted to post contents of a linux file and some console output and your darn spam filter denied my post for beeing SPAM.