Skip to content

Web Page Scraping Using PHP Script and Simple HTML DOM Parser

by Jon on September 18th, 2013

Clients have asked me for scripts to copy information from websites systematically so they don’t have to pay someone to sit in front of a computer and manually copy the data. Writing a script for copying information was easier than I expected by using a Simple HTML DOM Parser library someone had already wrote. The library provides a programmer with amazing capabilities for reading the html on a webpage. This could be used to grab tables of information and cycle thorough pages of results while dumping the data into a database. Below are a few examples of how to search through pages.

1. Search page for open div’s that exactly match of span[class=results-test] then loop through the results and print the plaintext inside the div.

include_once('simple_html_dom.php');
$webpage_url = $base_url . "/" . $subpage_url;
$html = file_get_html($webpage_url);
foreach($html->find('span[class="results-test"]') as $single_result) {
$full_text = trim($single_result->plaintext);
echo $full_text;
}

2. Using the caret in front of the = sign does a match on the beginning of the attribute listed, so span[class^=bold-class] could match span[class=bold-class-test] which could have any number of variations after the “bold-class”.

include_once('simple_html_dom.php');
$webpage_url = $base_url . "/" . $subpage_url;
$html = file_get_html($webpage_url);
foreach($html->find('span[class^=bold-class]') as $single_result) {
$full_text = trim($single_result->plaintext);
echo $full_text;
}

http://simplehtmldom.sourceforge.net/

From → Linux, Tech

No comments yet

Leave a Reply

Note: XHTML is allowed. Your email address will never be published.

Subscribe to this comment feed via RSS

%d bloggers like this: