Results 1 to 5 of 5
  1. #1
    iHolyElement's Avatar
    Join Date
    Aug 2009
    Gender
    male
    Location
    i can design a nation.
    Posts
    826
    Reputation
    11
    Thanks
    55

    Create a PHP web crawler/scraper in 5 minutes.

    Utilizing the PHP programming language i'll show you how to create an infinitely extendable web crawler in under 5 minutes, collecting images and links.

    The Crawler Framework

    First you need to create the crawler class as follows:

    Code:
    <?php
    class Crawler {
    
    }
    ?>
    You then will create methods to fetch the web pages markup, and to parse it for data that you are looking at collecting. The only public methods will be getMarkup() and get() as the parsing methods will generally be used privately for the crawler, however the visibility is set to protected since you never know who will want to extend its functionality.

    Code:
    <?php
    class Crawler {
     
      protected $markup = '';
     
      public function __construct($uri) {
    
      }
     
      public function getMarkup() {
    
      }
    
      public function get($type) {
    
      }
     
      protected function _get_images() {
    
      }
     
      protected function _get_links() {
    
      }
    }
    ?>
    Fetching Site Markup

    The constructor will accept a URI so you can instantiate it such as new Crawler('https://iholyelement.org/'); which then will set our $markup property using PHP's file_get_contents() function which fetches the sites markup.

    Code:
    <?php
      public function __construct($uri) {
        $this->markup = $this->getMarkup($uri); 
      }
     
      public function getMarkup($uri) {
        return file_get_contents($uri);  
      }
    ?>
    Crawling The Markup For Data

    Our get() method will accept a $type string which essentially will simply be used to invoke another method actually doing the processing. As you can see below you construct the method name as a string, then make sure it is available so now developers can utilize this simply by invoking $crawl->get('images');

    You set visibility for _get_images() and _get_links() to protected so that developers will use our public get() method rather than getting confused and trying to invoke them directly.

    Each protected data collection method simply uses the PCRE (Perl Compatible Regular Expressions) function preg_match_all() in order to return all tags within the markup that are accepted using our patterns of /<img([^>]+)\/>/i and /<a([^>]+)\>(.*?)\<\/a\>/i. For more information on regular expressions visit Regular expression - Wikipedia, the free encyclopedia

    Code:
    <?php
      public function get($type) {
        $method = "_get_{$type}";
        if (method_exists($this, $method)){
          return call_user_method($method, $this);
        }
      }
     
      protected function _get_images() {
        if (!empty($this->markup)){
          preg_match_all('/<img([^>]+)\/>/i', $this->markup, $images);        
          return !empty($images[1]) ? $images[1] : FALSE;
        }
      }
     
      protected function _get_links() {
        if (!empty($this->markup)){
          preg_match_all('/<a([^>]+)\>(.*?)\<\/a\>/i', $this->markup, $links); 
          return !empty($links[1]) ? $links[1] : FALSE;
        }
      }
    ?>
    Final PHP Web Crawler Code And Usage

    Code:
    <?php
    class Crawler {
     
      protected $markup = '';
     
      public function __construct($uri) {
        $this->markup = $this->getMarkup($uri); 
      }
     
      public function getMarkup($uri) {
        return file_get_contents($uri);  
      }
    
      public function get($type) {
        $method = "_get_{$type}";
        if (method_exists($this, $method)){
          return call_user_method($method, $this);
        }
      }
     
      protected function _get_images() {
        if (!empty($this->markup)){
          preg_match_all('/<img([^>]+)\/>/i', $this->markup, $images);        
          return !empty($images[1]) ? $images[1] : FALSE;
        }
      }
     
      protected function _get_links() {
        if (!empty($this->markup)){
          preg_match_all('/<a([^>]+)\>(.*?)\<\/a\>/i', $this->markup, $links); 
          return !empty($links[1]) ? $links[1] : FALSE;
        }
      }
    }
    
    $crawl = new Crawler('https://iholyelement.org/');
    $images = $crawl->get('images');
    $links = $crawl->get('links');
    ?>
    Say thanks if this helped, i did use some resources but its a oldish tutorial i made and i forgot to put some of the links, this works as i use it for my own website(s).
    Last edited by iHolyElement; 09-22-2009 at 01:09 AM.

  2. The Following User Says Thank You to iHolyElement For This Useful Post:

    bocayroi1 (06-21-2016)

  3. #2
    InTheEnd's Avatar
    Join Date
    Sep 2009
    Gender
    male
    Posts
    10
    Reputation
    10
    Thanks
    0
    Quote Originally Posted by iHolyElement View Post
    Utilizing the PHP programming language i'll show you how to create an infinitely extendable web crawler in under 5 minutes, collecting images and links.

    The Crawler Framework

    First you need to create the crawler class as follows:

    Code:
    <?php
    class Crawler {
    
    }
    ?>
    You then will create methods to fetch the web pages markup, and to parse it for data that you are looking at collecting. The only public methods will be getMarkup() and get() as the parsing methods will generally be used privately for the crawler, however the visibility is set to protected since you never know who will want to extend its functionality.

    Code:
    <?php
    class Crawler {
     
      protected $markup = '';
     
      public function __construct($uri) {
    
      }
     
      public function getMarkup() {
    
      }
    
      public function get($type) {
    
      }
     
      protected function _get_images() {
    
      }
     
      protected function _get_links() {
    
      }
    }
    ?>
    Fetching Site Markup

    The constructor will accept a URI so you can instantiate it such as new Crawler('https://iholyelement.org/'); which then will set our $markup property using PHP's file_get_contents() function which fetches the sites markup.

    Code:
    <?php
      public function __construct($uri) {
        $this->markup = $this->getMarkup($uri); 
      }
     
      public function getMarkup($uri) {
        return file_get_contents($uri);  
      }
    ?>
    Crawling The Markup For Data

    Our get() method will accept a $type string which essentially will simply be used to invoke another method actually doing the processing. As you can see below you construct the method name as a string, then make sure it is available so now developers can utilize this simply by invoking $crawl->get('images');

    You set visibility for _get_images() and _get_links() to protected so that developers will use our public get() method rather than getting confused and trying to invoke them directly.

    Each protected data collection method simply uses the PCRE (Perl Compatible Regular Expressions) function preg_match_all() in order to return all tags within the markup that are accepted using our patterns of /<img([^>]+)\/>/i and /<a([^>]+)\>(.*?)\<\/a\>/i. For more information on regular expressions visit Regular expression - Wikipedia, the free encyclopedia

    Code:
    <?php
      public function get($type) {
        $method = "_get_{$type}";
        if (method_exists($this, $method)){
          return call_user_method($method, $this);
        }
      }
     
      protected function _get_images() {
        if (!empty($this->markup)){
          preg_match_all('/<img([^>]+)\/>/i', $this->markup, $images);        
          return !empty($images[1]) ? $images[1] : FALSE;
        }
      }
     
      protected function _get_links() {
        if (!empty($this->markup)){
          preg_match_all('/<a([^>]+)\>(.*?)\<\/a\>/i', $this->markup, $links); 
          return !empty($links[1]) ? $links[1] : FALSE;
        }
      }
    ?>
    Final PHP Web Crawler Code And Usage

    Code:
    <?php
    class Crawler {
     
      protected $markup = '';
     
      public function __construct($uri) {
        $this->markup = $this->getMarkup($uri); 
      }
     
      public function getMarkup($uri) {
        return file_get_contents($uri);  
      }
    
      public function get($type) {
        $method = "_get_{$type}";
        if (method_exists($this, $method)){
          return call_user_method($method, $this);
        }
      }
     
      protected function _get_images() {
        if (!empty($this->markup)){
          preg_match_all('/<img([^>]+)\/>/i', $this->markup, $images);        
          return !empty($images[1]) ? $images[1] : FALSE;
        }
      }
     
      protected function _get_links() {
        if (!empty($this->markup)){
          preg_match_all('/<a([^>]+)\>(.*?)\<\/a\>/i', $this->markup, $links); 
          return !empty($links[1]) ? $links[1] : FALSE;
        }
      }
    }
    
    $crawl = new Crawler('https://iholyelement.org/');
    $images = $crawl->get('images');
    $links = $crawl->get('links');
    ?>
    Say thanks if this helped, i did use some resources but its a oldish tutorial i made and i forgot to put some of the links, this works as i use it for my own website(s).
    no copy pasta in this section bitch

    also this topic is already covered in the mpgh php pdf book

  4. #3
    iHolyElement's Avatar
    Join Date
    Aug 2009
    Gender
    male
    Location
    i can design a nation.
    Posts
    826
    Reputation
    11
    Thanks
    55
    i dont remember copying anything when i did the tut ages ago...
    freak.

  5. #4
    Marthz's Avatar
    Join Date
    Aug 2009
    Gender
    male
    Location
    ghfg
    Posts
    50
    Reputation
    10
    Thanks
    5
    I don't get it.... So would this be how I make a crawler for emails on craigslist?


    Code:
    <?php
    class Crawler {
     
      protected $markup = '';
     
      public function __construct($uri) {
        $this->markup = $this->getMarkup($uri); 
      }
     
      public function getMarkup($uri) {
        return file_get_contents($uri);  
      }
    
      public function get($type) {
        $method = "_get_{$type}";
        if (method_exists($this, $method)){
          return call_user_method($method, $this);
        }
    
      }
     
      protected function _get_email() {
        if (!empty($this->markup)){
          preg_match_all('/b[A-Z0-9._%-]+@[A-Z0-9.-]+/.[A-Z]{2,4}/b', $this->markup, $email); 
          return !empty($email[1]) ? $email[1] : FALSE;
        }
      }
    }
    
    $crawl = new Crawler('https://craigslist.org/');
    $email = $crawl->get('email');
    ?>
    Ah yes. Outstanding.

  6. #5
    VvITylerIvV's Avatar
    Join Date
    Oct 2009
    Gender
    male
    Location
    The streets
    Posts
    668
    Reputation
    5
    Thanks
    61
    My Mood
    In Love
    Quote Originally Posted by iHolyElement View Post
    i dont remember copying anything when i did the tut ages ago...
    freak.
    Ikr? He also has 10 posts and thinks he is bigger then life
    Favourite quotes:

    Code:
    I don't need easy, I just need possible. ~ Me 
    
    There are three birds on a fence. Two decide to fly away, how many are left? Three, just because you decide to do something doesn't mean you've done it. ~ Don't know who said this
    
    Do not go where the path may lead, go instead where there is no path and leave a trail. ~ Ralph Waldo Emerson
    Quote Originally Posted by VirtualSia View Post
    You both have a very weird and awkward view on Computer science.
    Computer science is about computing, which is programming.
    Definition of computing: The use or operation of computers.
    Turning on my computer = computing = programming
    LAWLFAIL

Similar Threads

  1. Big List of Free Web Services
    By sp5710 in forum Spammers Corner
    Replies: 20
    Last Post: 12-22-2018, 07:54 PM
  2. Web-based game hacking..
    By Krilliam in forum General Game Hacking
    Replies: 7
    Last Post: 02-20-2006, 01:12 PM
  3. Replies: 13
    Last Post: 02-09-2006, 10:25 PM
  4. how to create speedhacks?
    By LiLLeO in forum General Game Hacking
    Replies: 5
    Last Post: 01-28-2006, 08:52 AM
  5. Creating A GunzRunnable
    By CrazyDeath in forum Game Hacking Tutorials
    Replies: 7
    Last Post: 01-01-2006, 11:20 PM

Tags for this Thread