Introduction

Web scraping and data mining is a popular subject in today’s web world, since scraping is a touchy legal subject it’s always best to gain permission from the owners of a site before you scrape it’s data. We’re using our own site for this tutorial just to be safe from the legal perspective. Today we’re going to discuss how to make a basic web scraper with just a few lines of PHP utilizing regular expressions (regex) to find our specific data. We’ll then display our aggregated data in the browser.

Diving Into Code

Since there’s no real set-up or user interface for this tutorial we’re just going to jump straight into the PHP code. Let’s get to it!

First, we want to get the contents of the page we’re scraping, we can do this by using the file_get_contents() function:

We’re saving the whole html contents of our home page as a string stored into the variable $html.

Next, we need to decide which data we want to extract, in this case we just want the titles of the articles and their corresponding links. We’re going to be using PHP’s preg_match_all() function to extract our data using regex.

Here’s the code for grabbing the titles and links:

As you can see we’re doing a little more than just searching for the data with regex, we’re also creating an array and saving the posts in order using PREG_SET_ORDER.

Now we want to do something with our new array full of data, $posts. Let’s echo out the titles as links to each of their corresponding URLs:

Now we’re just going to add a counter to see how many posts we’re returning so after the foreach statement type the following:

Our whole file should look something like this:

Just a few lines of code really, and when we run this script in our browser it returns an html page with a “list” (not actually a ul or ol, but paragraphs) of our titles that just happen to be clickable links to their full article pages.

Conclusion

We’ve learned some of the basics for web scraping with PHP and regex, remember to use your knowledge responsibly and always ask for permission before you scrape web pages.