+44 (0)1179 680989

Stop the Black Hatters

17th March 2011 03:00 by tom

Everyone who does even the smallest amount of SEO, appreciates the importance of outbound links. Black hat seo monkeys are no exception. Except rather than sending one of those annoying 'will you be my friend' emails to request a link, they just hack away until they find a route into your site and place the links their themselves.

Clearly the best approach would be to lock down your website to such a degree that you'd need a sledge hammer to get in, but in the read/write web of 2.0 and beyond, this would be similar to never leaving your house, in case you happen to not see a bus coming the other way. We have to keep our sites open for conversation and partipication.

But rather than sitting on your hands and hoping for the best, why not keep an eye on all the outbound links coming from your site? That's why I've written a neat little script which does just that. You can feed it any website you like, and it will crawl through the site, logging all the outbound links it finds along the way. Then once it's finished, it sends you a summary email with all the external links on your site.

You can even add a 'black list' of sites you want to watch out for. Then if one of these sites are spotted, the script can send you a text message alerting you immediately. You'll have to buy a bulk load of SMS credits from clickatell, but they're not expensive and the clickatell api is very quick and easy to use - we use it for uptime monitoring alerts as well.

So here's my script for you to play with. This is a command line script, so you'll need to run via an SSH terminal or something similar in it's current form, although it wouldn't be hard to make it accessible through a browser.

Please bear in mind this is only a very quick script that I threw together one evening, so it comes with no guarantee, support or anything else, so use at your own risk (all the usual disclaimers ;). It's also rather basic and could do with a fair amount of tuning, but it does the job.

To execute it, make sure it's got exec priviledges, i.e. chmod 755 or something similar, then type:

./spamcheck.php http://www.sitetoscan.com

#!/usr/bin/php <?php /* * Simple script to crawl a website and report on external links * * @category security * @author Tom Freeman * @copyright 2011 The Authors * @link http://www.18aproductions.co.uk */ print 'Script started: '.date('jS M D H:i:s')."\n"; echo "----------------------------------------------------\n\n"; if ($argc<2) { echo "Error: Please supply a website to scan\n"; exit(0); } // Define some basic settings for the script $config['site_name'] = '18a Productions'; $config['report_email'] = array('support@18aproductions.co.uk'); $config['debug'] = false; // SMS reporting is available if you open an API account with // clickatell and buy an SMS bundle http://www.clickatell.com/ $config['send_sms'] = true; $config['report_sms'] = array('447999999999'); // include the 44 bit at the front for UK mobiles $config['sms_username'] = ''; $config['sms_password'] = ''; $config['sms_appid'] = ''; $config['max_crawls'] = 50; // The maximum number of pages on your site you want to crawl // These are the urls to watch out for on your site $config['uri_watch_list'] = array( 'basicpills', 'generic-ed-pharmacy', 'getrxpills', 'rx-prices', 'antibiotics-shop', ); $uri_to_visit = array($argv[1]); $base_parts = parse_url($uri_to_visit[0]); echo ($config['debug']) ? 'Base host: '.$base_parts['host']."\n" : ''; $uri_external = array(); $i=0; $sleep_count=0; while( (count($uri_to_visit)>0 and $i<$config['max_crawls']) ) { $sleep_count++; $i++; if ($sleep_count==20) { // Be nice to web server admins and sleep // for 10 seconds every 20 page requests sleep(10); $sleep_count=0; } $uri = array_pop($uri_to_visit); echo "Checking ".$uri.' ('.$i.")\n"; $uri_visited[] = $uri; $html = get_web_page($uri); $dom = new DOMDocument(); @$dom->loadHTML($html['content']); $nodes = $dom->getElementsByTagName('a'); foreach($nodes as $node) { $href_value = $node->getAttribute('href'); // Internal or external? $parts = parse_url($href_value); if (empty($parts['host'])) { $href_value = $base_parts['scheme'].'://'.$base_parts['host'].$href_value; $parts = parse_url($href_value); } if ($base_parts['host']==$parts['host']) { echo ($config['debug']) ? 'Internal link: '.$href_value : ''; if ( !in_array($href_value,$uri_to_visit) and !in_array($href_value,$uri_visited) ) { echo ($config['debug']) ? " - added to visit list\n" : ''; array_push($uri_to_visit,$href_value); } else { echo ($config['debug']) ? " - skipped\n" : ''; } } else { echo ($config['debug']) ? 'External link: '.$href_value."\n" : ''; if ( !in_array($href_value,$uri_external) ) { array_push($uri_external,$href_value); foreach($config['uri_watch_list'] as $uri_to_watch) { $pattern = preg_quote($uri_to_watch); if (preg_match('/'.$pattern.'/',$parts['host'],$matches)) { echo 'Found banned site: '.$href_value."\n"; $msg = 'Found link to: '.$href_value.chr(13); $msg .= 'Found on: '.$uri.chr(13); foreach($config['report_email'] as $email) { mail($email,'Banned Site Found on '.$config['site_name'],$msg); } if ($config['send_sms']==true) { foreach($config['report_sms'] as $smsno) { $sms_url = "http://api.clickatell.com/http/sendmsg? $sms_url .= "user=".$config['sms_username']; $sms_url .= "&password=".$config['sms_password']; $sms_url .= "&api_id=".$config['sms_appid']; $sms_url .= "&to=".$smsno; $sms_url .= "&text=".urlencode($msg); $sms_result = get_web_page($sms_url); if ($config['debug']) { print_r($sms_result); } } } } } } } } echo ($config['debug']) ? "To visit:\n" : ''; if ($config['debug']) { print_r($uri_to_visit); } echo ($config['debug']) ? "Visited:\n" : ''; if ($config['debug']) { print_r($uri_visited); } } echo "\n----------------------------------------------------\n"; echo "External Links\n"; print_r($uri_external); $msg = "The following external links were found on the Mendips website\n"; for($i=0;$i<count($uri_external);$i++) { $msg .= $uri_external[$i]."\n"; } foreach($config['report_email'] as $email) { mail($email,$config['site_name'].' External Link Report',$msg); } echo 'Script finished: '.date('jS M D H:i:s')."\n"; /** * Get a web file (HTML, XHTML, XML, image, etc.) from a URL. Return an * array containing the HTTP server response header fields and content. */ function get_web_page( $url ) { $options = array( CURLOPT_RETURNTRANSFER => true, // return web page CURLOPT_HEADER => false, // don't return headers CURLOPT_FOLLOWLOCATION => true, // follow redirects CURLOPT_ENCODING => "", // handle all encodings CURLOPT_USERAGENT => "Uptime Robot", // who am i CURLOPT_AUTOREFERER => true, // set referer on redirect CURLOPT_CONNECTTIMEOUT => 30, // timeout on connect CURLOPT_TIMEOUT => 30, // timeout on response CURLOPT_MAXREDIRS => 10, // stop after 10 redirects ); $ch = curl_init( $url ); curl_setopt_array( $ch, $options ); $content = curl_exec( $ch ); $err = curl_errno( $ch ); $errmsg = curl_error( $ch ); $header = curl_getinfo( $ch ); curl_close( $ch ); $header['errno'] = $err; $header['errmsg'] = $errmsg; $header['content'] = $content; return $header; }

Want to share? Tweet it!


Related Projects

Comments

  • Wilfredo 23rd February 2012
    Wonderful site. Lots of useful info here. I'm sending it to several buddies ans additionally sharing in delicious. And certainly, thanks on your effort!
Leave a Comment

Allowable tags: <b>, <em>