Utilizing the PHP programming language i'll show you how to create an infinitely extendable web crawler in under 5 minutes, collecting images and links.
The Crawler Framework
First you need to create the crawler class as follows:
Code:
<?php
class Crawler {
}
?>
You then will create methods to fetch the web pages markup, and to parse it for data that you are looking at collecting. The only public methods will be getMarkup() and get() as the parsing methods will generally be used privately for the crawler, however the visibility is set to protected since you never know who will want to extend its functionality.
Code:
<?php
class Crawler {
protected $markup = '';
public function __construct($uri) {
}
public function getMarkup() {
}
public function get($type) {
}
protected function _get_images() {
}
protected function _get_links() {
}
}
?>
Fetching Site Markup
The constructor will accept a URI so you can instantiate it such as new Crawler('https://iholyelement.org/'); which then will set our $markup property using PHP's file_get_contents() function which fetches the sites markup.
Code:
<?php
public function __construct($uri) {
$this->markup = $this->getMarkup($uri);
}
public function getMarkup($uri) {
return file_get_contents($uri);
}
?>
Crawling The Markup For Data
Our get() method will accept a $type string which essentially will simply be used to invoke another method actually doing the processing. As you can see below you construct the method name as a string, then make sure it is available so now developers can utilize this simply by invoking $crawl->get('images');
You set visibility for _get_images() and _get_links() to protected so that developers will use our public get() method rather than getting confused and trying to invoke them directly.
Each protected data collection method simply uses the PCRE (Perl Compatible Regular Expressions) function preg_match_all() in order to return all tags within the markup that are accepted using our patterns of /<img([^>]+)\/>/i and /<a([^>]+)\>(.*?)\<\/a\>/i. For more information on regular expressions visit
Regular expression - Wikipedia, the free encyclopedia
Code:
<?php
public function get($type) {
$method = "_get_{$type}";
if (method_exists($this, $method)){
return call_user_method($method, $this);
}
}
protected function _get_images() {
if (!empty($this->markup)){
preg_match_all('/<img([^>]+)\/>/i', $this->markup, $images);
return !empty($images[1]) ? $images[1] : FALSE;
}
}
protected function _get_links() {
if (!empty($this->markup)){
preg_match_all('/<a([^>]+)\>(.*?)\<\/a\>/i', $this->markup, $links);
return !empty($links[1]) ? $links[1] : FALSE;
}
}
?>
Final PHP Web Crawler Code And Usage
Code:
<?php
class Crawler {
protected $markup = '';
public function __construct($uri) {
$this->markup = $this->getMarkup($uri);
}
public function getMarkup($uri) {
return file_get_contents($uri);
}
public function get($type) {
$method = "_get_{$type}";
if (method_exists($this, $method)){
return call_user_method($method, $this);
}
}
protected function _get_images() {
if (!empty($this->markup)){
preg_match_all('/<img([^>]+)\/>/i', $this->markup, $images);
return !empty($images[1]) ? $images[1] : FALSE;
}
}
protected function _get_links() {
if (!empty($this->markup)){
preg_match_all('/<a([^>]+)\>(.*?)\<\/a\>/i', $this->markup, $links);
return !empty($links[1]) ? $links[1] : FALSE;
}
}
}
$crawl = new Crawler('https://iholyelement.org/');
$images = $crawl->get('images');
$links = $crawl->get('links');
?>
Say thanks if this helped, i did use some resources but its a oldish tutorial i made and i forgot to put some of the links, this works as i use it for my own website(s).