PHP Web Page Scraping Tutorial
==============================
Web Scraping, also known as Web Harvesting and/or Web Data Extraction is the process of extracting data from a given web site or web page. In this tutorial I will go over a way for you to extract the title of a page, as well as the meta keywords, meta description, and links. With some basic knowledge in PHP and Regular Expressions you can accomplish this process with ease.
First lets go over the regular expression meta characters we will be using in this tutorial.
(.*)
Plain Text
The dot (.) stands for any character while the asterisks (*) stands for 0 or more characters. When both are combined (.*) you are letting the system know that you are looking for any set of characters with a length of 0 or more.
As for our PHP, we will be using 3 functions in order to extract our data. The first function is our file_get_contents() function which will get the desired page and input all of its contents and html into a string format. The second function we will be using is our preg_match() function which will return us one result when given the regular expression code. The final function we will be using is preg_match_all() which works the same as preg_match() just that preg_match_all() will return more then 1 result.
For this tutorial I have included 1 HTML page that contains our Title Tag, Meta Description, Meta Keywords and Some Links. We will be using that file for our scraping purposes.
Lets start by setting up our variable that will contain our string of html from the external file.
<?php
$file_string = file_get_contents('page_to_scrape.html');
?>
Plain Text
What we did above is simply get all of the contents from our file page_to_scrape.html and store it to a string. Now that we have our string we can then proceed to the next portion of extraction.
* Hint: You can replace page_to_scrape.html with any page or link you may want to scrape. Some sites my have terms against scraping so be sure to read the terms of use before you decide to scrape a site.
Lets start by extracting the text within our <title></title> tags. In order to accomplish this we need to use our preg_match() function. Given 3 parameters the preg_match() function will return us an array with our result. The first parameter will be our regular expression, the second parameter will be our variable containing the html content, and our third parameter will be our out put array which will contain our results.
<?php
$file_string = file_get_contents('page_to_scrape.html');
preg_match('/<title>(.*)<\/title>/i', $file_string, $title);
$title_out = $title[1];
?>
Plain Text
Let me explain what I did in the above code. First we know that we want the text from within the title tags <title></title>. So we need to insert (.*) in between the title tags to retrieve any characters that we may have within them. When using a regular expression in the preg_match() function we need to encapsulate our regular expression within two forward slashes. You could use other characters such as {} and more. For this example though we will use the forward slashes. I append a lower case i to the end to search case insensitive. We also need to escape the forward slash in the closing title tag so that our script does not end its search there. For our second parameter I passed through our variable $file_string which we defined earlier to contain our HTML content. Lastly we pass our third parameter which will out put an array of result. Now that we have an array I assigned the element of the array that we want to the variable $title_out for later usage.
Next we need to get the Meta Description and Meta Keywords. We will just do the same as what we did above and just change the HTML and output names as follows.
preg_match('/<meta name="keywords" content="(.*)" \/> /i', $file_string, $keywords);
$keywords_out = $keywords[1];
preg_match('/<meta name="description" content="(.*)" \/> /i', $file_string, $description);
$description_out = $description[1];
Plain Text
Finally we need to retrieve our list of links on the page. In my sample HTML document I have my links enclosed within <li></li> tags. I will use this in conjunction with the <a></a> tags to extract my data. For this we will need to use our preg_match_all() function that way we can return back more then 1 result. For this function we will pass through our parameters just as we did with the preg_match() function.
With the above code we now have an array assigned to $links with all of the results. Notice that I used the meta characters (.*) more then once this time. The reason for this is since the data will not be consistently the same we need to let the script know that any set of characters may be there in its place. Our $links array will return an array that contains the data within the href=”" as well as the data between the <a></a> tags.
Now that we have all of our data that we want to collect. We can simply just print them out as follows:
<p><strong>Title:</strong> <?php echo $title_out; ?></p>
<p><strong>Keywords:</strong> <?php echo $keywords_out; ?></p>
<p><strong>Description:</strong> <?php echo $description_out; ?></p>
<p><strong>Links:</strong> <em>(Name - Link)</em><br />
<?php
echo '<ol>';
for($i = 0; $i < count($links[1]); $i++) {
echo '<li>' . $links[2][$i] . ' - ' . $links[1][$i] . '</li>';
}
echo '</ol>';
?>
</p>