Everyone who does even the smallest amount of SEO, appreciates the importance of outbound links. Black hat seo monkeys are no exception. Except rather than sending one of those annoying ‘will you be my friend’ emails to request a link, they just hack away until they find a route into your site and place the links their themselves.

Clearly the best approach would be to lock down your website to such a degree that you’d need a sledge hammer to get in, but in the read/write web of 2.0 and beyond, this would be similar to never leaving your house, in case you happen to not see a bus coming the other way. We have to keep our sites open for conversation and partipication.

But rather than sitting on your hands and hoping for the best, why not keep an eye on all the outbound links coming from your site? That’s why I’ve written a neat little script which does just that. You can feed it any website you like, and it will crawl through the site, logging all the outbound links it finds along the way. Then once it’s finished, it sends you a summary email with all the external links on your site.

You can even add a ‘black list’ of sites you want to watch out for. Then if one of these sites are spotted, the script can send you a text message alerting you immediately. You’ll have to buy a bulk load of SMS credits from clickatell, but they’re not expensive and the clickatell api is very quick and easy to use – we use it for uptime monitoring alerts as well.

So here’s my script for you to play with. This is a command line script, so you’ll need to run via an SSH terminal or something similar in it’s current form, although it wouldn’t be hard to make it accessible through a browser.

Please bear in mind this is only a very quick script that I threw together one evening, so it comes with no guarantee, support or anything else, so use at your own risk (all the usual disclaimers ;). It’s also rather basic and could do with a fair amount of tuning, but it does the job.

To execute it, make sure it’s got exec priviledges, i.e. chmod 755 or something similar, then type:

./spamcheck.php http://www.sitetoscan.com

#!/usr/bin/php
<?php
/*
* Simple script to crawl a website and report on external links
*
* @category security
* @author Tom Freeman
* @copyright 2011 The Authors
* @link https://www.18aproductions.co.uk
*/

print ‘Script started: ‘.date(‘jS M D H:i:s’).”n”;
echo “—————————————————-nn”;

if ($argc<2)
{
echo “Error: Please supply a website to scann”;
exit(0);
}

// Define some basic settings for the script
$config[‘site_name’] = ’18a Productions’;
$config[‘report_email’] = array(‘support@18aproductions.co.uk’);
$config[‘debug’] = false;

// SMS reporting is available if you open an API account with
// clickatell and buy an SMS bundle http://www.clickatell.com/
$config[‘send_sms’] = true;
$config[‘report_sms’] = array(‘447999999999’); // include the 44 bit at the front for UK mobiles
$config[‘sms_username’] = ”;
$config[‘sms_password’] = ”;
$config[‘sms_appid’] = ”;
$config[‘max_crawls’] = 50; // The maximum number of pages on your site you want to crawl

// These are the urls to watch out for on your site
$config[‘uri_watch_list’] = array(
‘basicpills’,
‘generic-ed-pharmacy’,
‘getrxpills’,
‘rx-prices’,
‘antibiotics-shop’,
);

$uri_to_visit = array($argv[1]);
$base_parts = parse_url($uri_to_visit[0]);

echo ($config[‘debug’]) ? ‘Base host: ‘.$base_parts[‘host’].”n” : ”;

$uri_external = array();
$i=0;
$sleep_count=0;

while( (count($uri_to_visit)>0 and $i<$config[‘max_crawls’]) )
{
$sleep_count++;
$i++;
if ($sleep_count==20)
{
// Be nice to web server admins and sleep
// for 10 seconds every 20 page requests
sleep(10);
$sleep_count=0;
}

$uri = array_pop($uri_to_visit);
echo “Checking “.$uri.’ (‘.$i.”)n”;
$uri_visited[] = $uri;

$html = get_web_page($uri);

$dom = new DOMDocument();
@$dom->loadHTML($html[‘content’]);

$nodes = $dom->getElementsByTagName(‘a’);

foreach($nodes as $node)
{
$href_value = $node->getAttribute(‘href’);

// Internal or external?
$parts = parse_url($href_value);

if (empty($parts[‘host’]))
{
$href_value = $base_parts[‘scheme’].’://’.$base_parts[‘host’].$href_value;
$parts = parse_url($href_value);
}

if ($base_parts[‘host’]==$parts[‘host’])
{
echo ($config[‘debug’]) ? ‘Internal link: ‘.$href_value : ”;
if ( !in_array($href_value,$uri_to_visit) and !in_array($href_value,$uri_visited) )
{
echo ($config[‘debug’]) ? ” – added to visit listn” : ”;
array_push($uri_to_visit,$href_value);
}
else
{
echo ($config[‘debug’]) ? ” – skippedn” : ”;
}
}
else
{
echo ($config[‘debug’]) ? ‘External link: ‘.$href_value.”n” : ”;
if ( !in_array($href_value,$uri_external) )
{
array_push($uri_external,$href_value);

foreach($config[‘uri_watch_list’] as $uri_to_watch)
{
$pattern = preg_quote($uri_to_watch);
if (preg_match(‘/’.$pattern.’/’,$parts[‘host’],$matches))
{
echo ‘Found banned site: ‘.$href_value.”n”;
$msg = ‘Found link to: ‘.$href_value.chr(13);
$msg .= ‘Found on: ‘.$uri.chr(13);

foreach($config[‘report_email’] as $email)
{
mail($email,’Banned Site Found on ‘.$config[‘site_name’],$msg);
}

if ($config[‘send_sms’]==true)
{
foreach($config[‘report_sms’] as $smsno)
{
$sms_url = “http://api.clickatell.com/http/sendmsg?
$sms_url .= “user=”.$config[‘sms_username’];
$sms_url .= “&password=”.$config[‘sms_password’];
$sms_url .= “&api_id=”.$config[‘sms_appid’];
$sms_url .= “&to=”.$smsno;
$sms_url .= “&text=”.urlencode($msg);

$sms_result = get_web_page($sms_url);
if ($config[‘debug’]) { print_r($sms_result); }
}
}
}
}

}
}

}

echo ($config[‘debug’]) ? “To visit:n” : ”;
if ($config[‘debug’]) { print_r($uri_to_visit); }

echo ($config[‘debug’]) ? “Visited:n” : ”;
if ($config[‘debug’]) { print_r($uri_visited); }
}

echo “n—————————————————-n”;
echo “External Linksn”;
print_r($uri_external);

$msg = “The following external links were found on the Mendips websiten”;

for($i=0;$i<count($uri_external);$i++)
{
$msg .= $uri_external[$i].”n”;
}

foreach($config[‘report_email’] as $email)
{
mail($email,$config[‘site_name’].’ External Link Report’,$msg);
}

echo ‘Script finished: ‘.date(‘jS M D H:i:s’).”n”;

/**
* Get a web file (HTML, XHTML, XML, image, etc.) from a URL. Return an
* array containing the HTTP server response header fields and content.
*/
function get_web_page( $url )
{
$options = array(
CURLOPT_RETURNTRANSFER => true, // return web page
CURLOPT_HEADER => false, // don’t return headers
CURLOPT_FOLLOWLOCATION => true, // follow redirects
CURLOPT_ENCODING => “”, // handle all encodings
CURLOPT_USERAGENT => “Uptime Robot”, // who am i
CURLOPT_AUTOREFERER => true, // set referer on redirect
CURLOPT_CONNECTTIMEOUT => 30, // timeout on connect
CURLOPT_TIMEOUT => 30, // timeout on response
CURLOPT_MAXREDIRS => 10, // stop after 10 redirects
);

$ch = curl_init( $url );
curl_setopt_array( $ch, $options );
$content = curl_exec( $ch );
$err = curl_errno( $ch );
$errmsg = curl_error( $ch );
$header = curl_getinfo( $ch );
curl_close( $ch );

$header[‘errno’] = $err;
$header[‘errmsg’] = $errmsg;
$header[‘content’] = $content;
return $header;
}