Writing a Simple Site Crawler
0Writing a Simple Site Crawler
Introduction
Well, it’s been a long time since I’ve written a post on my own blog post, in fact over 6 months! A lot has happened in my life since then so if you know me and want to know more then drop me a tweet (http://twitter.com/mattstannard/) or e-mail matt@mattstannard.com.
The last thing I was blogging about was writing some PHP to ping your Facebook Wall and then read wall updates and reply to them automatically. I will continue
part two of this blog, but since then I’ve written numureous other scripts to Monitor Twitter and Crawl Websites so I thought I’d share some of this.
So, what is this script for and why would I write it? Well, I wanted something that would find the internal links on a website and download the HTML content so I could run some processing on it at a later date. Obviously, I could do this real time but I wanted something that I could work with offline should the need
arise hence the download to MySQL.
The Crawler
The first part of the page I wrote was actually a form so I could enter any URL and crawl it, however I am not going to include this – it’s easy enough to change and I don’t want to encourage people to randomly crawl others sites.
Once I had a URL to crawl, I used the PHP DOMDocument to pull the content from the website. This gave me the HTML and allowed me to use some built in methods to select all of the links on a page (again, you can modify the script to parse JavaScript etc. if needed). Once I had my base page, there were several challenges I had to overcome:
- Ensure the crawler didn’t go off site – I didn’t really want it indexing Facebook, or indeed any other external content it would fill up my laptop
- Ensure the crawler didn’t get stuck
For the first, I created a method to extracted the Base URL, if the URL was not relative to the root (i.e. it contained http://) I checked the domain matched the current sites URL. If it didn’t I didn’t index it.
For the second, I cheated slightly. I used MySQL and made the page being indexed a unique key. When a page was crawled, all links were put into MySQL with a crawled flag of “0″ (i.e. they need to be crawled). If I had already “seen” the URL then it wouldn’t be added again and thus the list of pages to crawl would get smaller.
Finally, I added a MAX_DEPTH constact to ensure that the cralwer only went to a depth of 10 pages, this would prevent it getting stuck in the “depths” of a site.
This actually meant that the logic of my recursive function was actually a lot more simplistic than I thought it would be:
- Take a URL to Crawl and my Current Depth
- Load the page and get all of the hyperlinks
- For each hyperlink on the page
- Check its not a bookmark, javascript or link off site
- Add it to the list of pages to index
- Next Link
- Select all uncrawled links in database and for each link
- Set the crawl state to crawled
- Increment Depth and Call Function Recursively
- Finish
This seemed to run perfectly which was great, I will be tweaking and reviewing its performance, perhaps multi-threading it in the future and as always any feedback or
questions are welcomed.
Code
set_time_limit(0);
function GetDomainFromURL($URL)
{
$strURL = $URL;
// We don't want http://
$strURL = str_replace("http://","",$strURL);
// Then we want the left of the first /
if (!(strpos($strURL,"/") === false))
{
$strURL = substr($strURL,0,strpos($strURL,"/")-1);
}
$strURL = trim($strURL);
return($strURL);
}
function CrawlPage($BaseURL,$iDepth,$resMySQL)
{
$intSanity;
$intSanity = 0;
$strURL = $BaseURL;
$strBase = "";
if ($iDepth > MAX_DEPTH)
{
return(0);
}
echo str_repeat("  ",$iDepth);
echo "Entered CrawlDomain
";
// Get the base domain
$strBase = GetDomainFromURL($BaseURL);
// Define an XML Document and load the page into
$doc = new DOMDocument();
@$doc->loadHTMLFile($strURL);
// Lets build up a list of pages that are linked
foreach($doc->getElementsByTagName('a') as $link)
{
$href = $link->getAttribute('href');
// echo "Found: " . $href . " (" . strpos($href,"http://" . $strBase) . ")
";
// If the link starts with a / we want to add the top level domain
// We don't want to crawl anything which is a bookmark (i.e. a #)
if (substr($href,0,1) != "#" && (strpos($href,"javascript:") === false))
{
if (substr($href,0,1) == "/")
{
$href = "http://" . $strBase . $href;
// echo "Link:" . $href . "
";
$strSQL = "INSERT INTO tblcrawllist (SessionID,CrawlURL,FirstParent,PageContent,CrawlLevel,Crawled) VALUES ('" . session_id() . "','" . mysql_real_escape_string($href) . "','" . mysql_real_escape_string($strURL) . "','" . mysql_real_escape_string($doc->saveHTML()) . "',$iDepth,0)";
mysql_query($strSQL,$resMySQL);
}
elseif(!(strpos($href,"http://" . $strBase) === false))
{
// Add this to the list of found URLs for this domain
// echo "Link:" . $href . "
";
$strSQL = "INSERT INTO tblcrawllist (SessionID,CrawlURL,FirstParent,PageContent,CrawlLevel,Crawled) VALUES ('" . session_id() . "','" . mysql_real_escape_string($href) . "','" . mysql_real_escape_string($strURL) . "','" . mysql_real_escape_string($doc->saveHTML()) . "',$iDepth,0)";
mysql_query($strSQL,$resMySQL);
}
} // End Validate HREF is valid
} // End ForEach Link
// Now let's go and crawl the pages we found
$strSQL = "SELECT SessionID,CrawlURL,CrawlLevel FROM tblcrawllist WHERE FirstParent='" . mysql_real_escape_string($strURL) . "' AND Crawled='0'";
$resSQL = mysql_query($strSQL,$resMySQL);
while ($row = mysql_fetch_array($resSQL))
{
$strSQL = "UPDATE tblcrawllist SET Crawled='1' WHERE CrawlURL='" . mysql_real_escape_string($row[1]) . "'";
mysql_query($strSQL);
$newDepth = ($iDepth + 1);
echo str_repeat("  ",$iDepth);
echo $href . " -> (" . $newDepth . ")
";
CrawlPage($row[1],$newDepth,$resMySQL);
}
return(1);
} // End GetDomainFromURL
Automatically Replying to a Facebook Wall Post (Part 1)
0So, I’ve recently found Skyscanner, something where you post on the wall and it will parse your wall post and then reply with a cheap flight and a link – so how is this possible? Well I’ve set myself the challenge of creating something to do something similar – tonight I began looking into how I would do this.
Right, so where to start, well a good place is the Facebook Developers area, there is a rich JavaScript API, however from what I can see the FB.Event.subscribe feature only seems to work on pages where you include the API (i.e. an iFrame) which of course is different to your own wall.
Given this is the case then, where next? Well, I thought about looking at the Open Graph itself and may be running a PHP Cron or Scheduled Task, before I could look at the Graph I needed a page, so I made a new Facebook Page to house the wall. This was caused Lets Try Something New - https://www.facebook.com/pages/Lets-Try-Something-New/182052465217566?sk=wall
After I had my page, I now should be able to query it using the Facebook Open Graph using the URL - https://graph.facebook.com/182052465217566 (where the number is my ID, no vanity URL as it doesn’t have any fans!).
So this is great, I can find out information about my page, however what about the Wall itself – well the Facebook Graph object which handles a wall is the feed. The URL for this on my page is - https://graph.facebook.com/182052465217566/feed you will notice if you click the link you receive the error:
{
"error": {
"message": "An access token is required to request this resource.",
"type": "OAuthException"
}
}
This is because accessing the feed requires a valid access token. To see what happens and test querying the feed I used the Facebook Graph API explorer, an amazing tool, I grabbed a basic access token and could query the feed.
{
"data": [
{
"id": "182052465217566_182055751883904",
"from": {
"name": "Lets Try Something New",
"category": "Community",
"id": "182052465217566"
},
"message": "Wow this is my page",
"actions": [
{
"name": "Comment",
"link": "https://www.facebook.com/182052465217566/posts/182055751883904"
},
{
"name": "Like",
"link": "https://www.facebook.com/182052465217566/posts/182055751883904"
}
],
"privacy": {
"description": "Public",
"value": "EVERYONE"
},
"type": "status",
"created_time": "2011-11-22T23:32:00+0000",
"updated_time": "2011-11-22T23:32:00+0000",
"comments": {
"count": 0
}
}
],
"paging": {
"previous": "https://graph.facebook.com/182052465217566/feed?access_token=AAACEdEose0cBANP0ZCpL6XD50yYBZAsGc69ZBgHj9LE6kcOqdySFkLNOgGGLvCT8K4IJATTJlgZCPFUklIdjrH2zsDiNcLjBW3mbnP9QrAZDZD&limit=25&since=1322004720",
"next": "https://graph.facebook.com/182052465217566/feed?access_token=AAACEdEose0cBANP0ZCpL6XD50yYBZAsGc69ZBgHj9LE6kcOqdySFkLNOgGGLvCT8K4IJATTJlgZCPFUklIdjrH2zsDiNcLjBW3mbnP9QrAZDZD&limit=25&until=1322004719"
}
}
So given I can read my pages feed, what I now need is something to action it. There are a few things to consider here:
- I need to know what I’ve responded to so I need a MySQL or other type of database to store this information
- I need to be able to post back to the Graph API
Coming soon: Step 2 – The PHP!
A day in the life of a Matt Stannard – CTO
0People often ask me what I do for a living and when I say I am a CTO – Chief Technology Officer they usually look at me blankly and say “What’s one of those” and to be honest sometimes it can be quite difficult to easily explain so I thought I’d describe a typical day in the life of me, Matt Stannard, CTO at SEO Agency 4Ps Marketing.
My day usually starts at around 5:15am, I don’t think that this is a requirement of being a CTO it’s just I have to catch public transport and I prefer not travelling in Rush Hour, the only exception to this is if I cycle in, then its a 4:00am start. After doing the usual shower, get changed and walk to the station I usually spend a few minutes tweeting about my experiences and then check out the BBC News catching up on the latest Health, Science, Business and Technology news. I also tend to read SEOMoz, Google Blogspot, TechCrunch and Mashable if there is something of particular interest.
After arriving at the office, usually at around 7:30am I try to check through my e-mails and formulate a to do list for the day. The tasks I can undertake during a day include:
- Reviewing Facebook and Twitter API developments and Insights for our applications.
- Development of Facebook Applications or creating proof of concepts.
- Brainstorming Sessions.
- Technical Training.
- Website or Analytic Development.
- Client and Internal Meetings.
- Writing technology updates.
- Planning for Research and Development sessions.
Cycling to London – 40 miles in under 4 hours!
0For those who don’t know, I work for SEO Agency 4Ps who often take on challenges, be they in the form of Matt Phelans 1 Man 1 Mission, our Dodgeball team, Movember, Mens Health Challenge. I am not always the best at sports so a while ago I decided to challenge running machine Jack Mclaren to a race, he would run to work I would ride. I got absolutely thrashed, taking just over 4 hours and getting lost on the way!
Today, I decided to use my bike (a Giant Trance 4) rather than a Scott Genius MC Pro (Concept) as despite the Scott having a carbon fibre frame, the huge types and amazing suspension made it quite a hard ride!
I left my house this morning at 4:30am (yes it was dark and yes it is early!) and I think I arrived to the office at around 8:20am which is better than last time however I did stop twice so was only moving for 3hours 27 minutes! I am over the moon with that as the route is 40 miles!
So what have I learnt from my experience? Well firstly, cycling on the road despite being quite scary is much more effective than many of Londons cycle paths. Why? well, they are usually shared with footpaths, they are bumpy and you have to stop every few hundred yards for side roads. I’ve also decided the traffic management in Slough is wrong in that I sat waiting for 5 minutes as I wasn’t heavy enough to trigger its rotation! Also it seems to prioritise to minor routers rather than the trunk!
All in all though I love the ride, looking forward to doing it again (no I am not mad!).
Google and SSL (Secure) Search – What does it mean?
0Working for an SEO Agency obviously developments involving Google and other search engines are very important. Early this week Google announced that it was going to stop passing keywords when users used Secure Search and it was going to make Secure Search default for logged in users.
This has caused quite a stir in the SEO Industry as we rely on keyword data passed to Analytics to give us an indication as to the amount of traffic particular keywords drive to a site. Furthermore, other analytics tools such as Hitwise or Omniture may read this information for their analytics reports, so what is passed now?
With the team at 4Ps Marketing, I’ve run a few tests looking at the referrer to show the difference between SSL (and signed in) and non SSL search. The parameters passed when a user clicks the link are identical bar the “q” (query) parameter. Google Analytics will see this keyword as not provided.
So can we still access keyword data? Well “Yes” but not in quite the same form. Keywords are shown by linking your account to Webmaster Tools and viewing the keywords by Impressions and Click Through. In theory this should give similar results however many SEOs have questioned the reliability and accuracy of Webmasters Tools. Google also have said these will be limited to the top 1,000 keywords so it will be interesting to see how they compare.
Some people are not worried about the impact, certainly in the UK at the moment there is no version and the % of logged in users is relatively low but personally I feel it is worth familiarising ourselves with the changes. At anytime Google could switch all users to an SSL version of it’s site.
I welcome your comments, for now good bye!
Headers for Non SSL Results (q= is the query)
sa=t
rct=j
q=matt%20stannard
source=web
cd=1
sqi=2
ved=0CCcQFjAA
url=http%3A%2F%2Fmattstannard.com%2F
ei=qsOnTpaiOIOmhAe8z_iqDg
usg=AFQjCNFUHj6ax_omapnK2n2jl10j5Yjbmw
Headers for SSL Results (q= is the query – notice it’s empty)
sa=t
rct=j
q=
esrc=s
source=web
cd=2
sqi=2
ved=0CCsQFjAB
url=http%3A%2F%2Fmattstannard.com%2F
ei=isSnTsPMOcmnhAeNqOGaDg
usg=AFQjCNFUHj6ax_omapnK2n2jl10j5Yjbmw
sig2=XpAJZmif1NK3_pOT0DCpFw
The Value of Forums
0With the popularity of Social Media sites like Facebook, Twitter and LinkedIn it is very easy to overlook traditional more “old fashioned” channels like Forums, however in some instances these communities are as if not more valuable than their Social counterparts.
On Friday and over the weekend I had the task of upgrading a Windows SBS 2003 machine to Windows SBS 2011. I had planned the majority of this and Microsoft do give you some very good tools however there were some other curve balls in their such as having machines joined to the existing SBS 2003 domain, BES 5.0.2 being installed and versions of Acronis.
After installing SBS 2011 I found that BES 5.0.3 had issues and Acronis didn’t seem to work at all. I also had a few problems with the way Exchange 2010 handled SSL, namely it insisted on using remote.domain.com rather than mail.domain.com (I wanted). This is where Forums really come it to play. With a few carefully worded questions in Experts Exchange (a forum I am a member of) I was able to resolve all of these issues. It’s also really good to bounce ideas off other professionals.
So, my advice is don’t write off the value of older channels. The method of interaction may not be as instantaneous as Twitter and Facebook but the value they add can be just as valuable!
Ipswich vs Cardiff – Dave on the Road
0Todays post isn’t really about IT but it is related in that it’s about Dave, the loveable Stress Toy given to me by my colleague Luke . On this site, you may have noticed within the Fun Stuff section there is a part “WheresDave” – this was a Social Media experiment myself and Luke setup where I would take Dave and post pictures with the hash tag #WheresDave for my colleague Luke to view and then either find Dave, or tell me where he though Dave was.
I introduced Dave to the Ipswich Town fans who were very keen to become involved, taking photos or wanting there photo taken with Dave for uploading to Twitter or my Blog!
This weekend, Ipswich played Cardiff and Dave joined us on the train for the trip to Wales, you can see the images below, or search Twitter for the hash tag.
Matt Chat : An update on this weeks technology
0The second week of Matt Chat – if you’ve got any questions the tweet or e-mail me:
DART – Google’s new JavaScript alternative
0Last night on the train home I was checking out my Twitter feed and noticed a tweet from a colleague in the SEO team Ashley had tweeted that Google had released a new scripting language DART (formerly DASH).
Apparently, the language is designed to overcome some of the short comings within JavaScript and runs within a Virtual Machine. DART can also be converted into JavaScript allowing it to achieve 100% backwards and cross compatibility – so is this just another language or is this something the Web really needs, especially as many are still questioning the true benefits of HTML 5.
I’ve had a look at the overview and I have to say I am impressed with what I’ve seen. One of my frustrations with JavaScript was that it was largely un-typed. DART also is untyped but allows you to them firm up on data types at a later stage. This will be especially helpful if writing libraries for the use of others. DART also includes and promises to include a large number of additional libraries which can be included in a structure familiar to those developing in C, C++, Java or .NET.
Over the weekend I hope to use DART to manipulate the HTML 5 Canvas in some way shape or form so check back to see how I get on and for a bit more of an opinion on the language!!
Art Clokey
0Some of you may wonder why I would have a post about Art Clokey a pioneer in stop motion video with clay (similar to Nick Park and Aardman Animations famous for Wallace and Gromit).
The reason is really one of curiosity. Ever so often Google change their logo as a mark of respect to an individual or for a specific celebration of date of historical significance. What I was curious to find out was if I tried to make a page relevant to that event or individual whether my page would rank.
Today marks what would the 90th Birthday of Art Clokey who was born October 12, 1921 and died January 8, 2011.
Art Clokey made many short experimental clay animation films between 1955 and 1980 including Gumbasia, Mandala and The Clay Peacock. The films were clay shape animation usually featuring a Jazz sound track.
Perhaps more famous was Art Clokey’s team up with Dallas McKennon for the feature film Gumby: The movie. Although the movie not a success, Nickelodeon aired every episode of Gumby during prime viewing times at 8am and 2pm.
Art Clokey died in his sleep on January 8 2010, at an age of 8, at his home in Los Osos, California.





