How to scrape using Guzzle, Simple HTML Dom and anyIP.io?
In this quick tutorial, we will show you how to start to scrape any website using Guzzle (a PHP library) and using rotating proxies from anyIP.io.
The recommended way to install Guzzle is through Composer.
composer require guzzlehttp/guzzle
How to use Guzzle
Following the documentation, opening a page using Guzzle is pretty simple:
$client = new GuzzleHttpClient(); $res = $client->request('GET', 'https://www.example.com'); echo $res->getBody();
To use a proxy, you have to add a proxy parameter:
$res = $client->request("POST", "https://www.example.com", [ "proxy" => "https://username:[email protected]", ]);
How to parse the page?
The content of the page is in $res->getBody(). After checking that you actually got the correct result (the status code is 200, the content header is text or similar, etc.), you can start to parse the page. They are many options for this:
- Use a regex
- Use the DOM library from PHP
- Use Simple HTML Dom
- Use Ultimate Web Scraper
As a quick introduction to the scraping world, we will use Simple HTML Dom. After installing it and initialize it, you can simply use any CSS selector to retrieve the content of your choice:
$simpleHTMLDom = str_get_html($res->getBody()); $links = $simpleHTMLDom ->find('a');