Search Engine Scraper source code
Project offered by compunect [scraping@compunect.com] last successful test run: 28 Jan 2016This advanced PHP source code is developed to power scraping based projects. While the code can already be used from console (or browser) this source is mainly a base for customization. You can either customize this project by yourself or hire us to do what we can do best. compunect is an IT services and development company founded in Germany and now situated in Czech Republic focused at professional customers.
This free Search Engine Scraper already includes:
This scraper can operate 24 hours a day 7 days a week without getting blocked
Full support for the Google search engine
Scraping a list of keywords
Detection of organic results
Iterating through multiple result pages (configureable)
Scraping accurate global but targets also local results (by country) when using highest quality US IP addresses.
Supports Google filters (configureable)
Proper IP management, it can use our IP service API and automatically acquire IP addresses.
It will use proper delays between requests and prevents getting banned.
Data cache and history to prevent unrequired and overusage of IP addresses.
Accessible source code design to make customization easier
Perfectly suitable as background process in Linux environments
search-engine-scraper.phpScraping search engines became a serious business in the past years and it remains a very challenging task. We know how difficult it can be to find an experienced developer in this area and it is hardly possible at all to find detailed information online. We took quite a step by providing this source code for free as it contains rare knowledge and there is nothing else comparable available. We still release this for free, you may use this source code in your commercial project without paying us a cent. However, if you require customization or additional features we offer such services, after all who else could do it better ? If you require a professionally managed Linux server to run your projects on: we can help you to get this accomplished at a fair rate. You definitely will require high quality, dedicated IP addresses to power your project. We offer these services as well and would be glad to find a solution for you. If you are interested in scraping projects, check out the Google Suggest Scraping Spider as well. The Suggest Scraper can generate thousands of organic search relevant terms to be scraped.
More to know about scraping
It took us months of testing and developing to get accurate results from Google when using automated scripts. This source code already includes most of this work. We even included the possibility to gather local search results, so you can scrape results from any country without using IP addresses from that country. However, to receive correct results you will also need exceptional good IP addresses. We can provide this for you if you struggle to do it on your own. Extending the source to work for Bing, Yahoo or another search engine should not be a big leap as many of the core functions will stay similar.
What to do with this tool?
There are countless very interesting activities where this scraper comes in handy. Do you invest in Google adwords to have your websites ranked for competitive search terms ? Then you likely struggle with all those thousands of keywords Google wants you to invest money in, which ones to choose and which ones are a waste of money? Imagine being able to check your website rank for thousands of keywords and key phrases and only pay for those where your website is not ranked good enoughy. You can even automate the whole process using the adwords API to pay according to your organic rank per keyword and update this monthly. And on top of Googles own suggestions, maybe there are hundreds of oragnic relevant key phrases you do not even know about ? Use the Google Suggest Scraping Spider to find what people are really looking for, then use this Google Search Scraper to find out if you are ranked already. Are you optimizing your websites for Google or are you in the SEO business optimizing for your customers ? Track thousands of websites and keywords to see where you have to invest work. That way you can also track the efficiency of your various methods to improve the rank. Or go one step further and offer your customers a graph for all their websites and keywords which shows how well your work has influenced the ranks. Or go even one more step further and analyze the ranks of hundreds of thousands worldwide companies. You can use our Google Finance Scraping Spider to get all the companies out of Google Finance. You may also make the whole project interactive for users, let them get ranks or charts according to their keywords and websites. Of course this project can also be used to just brute force get massive amounts of URLs, titles according to a set of keywords. By doing regular scrape runs and putting the results into a database with timestamp you can unleash the real power of this project, if you need help to develop such extensions I am ready for hire.
IP/Proxy management
When scraping it is most essential to avoid detection. Google would ban any user who tries to automatically scrape their search engine results. In the worst case they can throw out a ban which blocks ten thousands of IP addresses permanently. This is usually all that happens, it threatens the project but not the legal entity behind it. However there is also a legal threat. If you do not accept the search engine TOS you should not have legal threats with passively scraping it. To make sure about that you need to consult your local lawyer. In any case it is possible to avoid getting detected, the free Search Engine Scraper on this website can be used longterm without detection. a) It will send Google requests at a rate of 10 requests per hour per IP address. b) It will calculate a proper delay between each request. c) It will not accept any tracking offered by Google. d) It will rotate the IP address at the correct moments. e) It will keep a local data cache and IP history.
Google captcha blocks automated access
If following these guidelines a block by captcha due to your own actions are very unlikely. When using a different IP/Proxy service the reason most likely come from shared IP usage or previous abuse. The Google Search Scraper from here already contains code to detect, detection and abort in that case. There are different typical error messages Google issues when it decided to block or slow down activity. Here are two examples:
We're sorry... ... but your query looks similar to automated requests from a computer virus or spyware application. To protect our users, we can't process your request right now. We'll restore your access as quickly as possible, so try again soon. In the meantime, if you suspect that your computer or network has been infected, you might want to run a virus checker or spyware remover to make sure that your systems are free of viruses and other spurious software. We apologize for the inconvenience, and hope we'll see you again on Google.
We're sorry... ... but your query looks similar to automated requests from a computer virus or spyware application. To protect our users, we can't process your request right now. We'll restore your access as quickly as possible, so try again soon. In the meantime, if you suspect that your computer or network has been infected, you might want to run a virus checker or spyware remover to make sure that your systems are free of viruses and other spurious software. If you're continually receiving this error, you may be able to resolve the problem by deleting your Google cookie and revisiting Google. For browser-specific instructions, please consult your browser's online support center. If your entire network is affected, more information is available in the Google Web Search Help Center. We apologize for the inconvenience, and hope we'll see you again on Google. To continue searching, please type the characters you see below:
Often a captcha is offered to continue searching, in the worst case Google completely blocks all access to one or all services for one or multiple IPs. This is a worst case scenario, if you stick to the peak rates and use IPs from us-proxies.com it is unlikely you will run into this problem.
US-Proxy support
This project runs through a US Proxy service, powered through the supplied API it is possible to scrape millions of results without getting blocked. The benefit of using us-proxies.com is an easily extendable IP service providing the best IP quality in the industry at a fair price aimed toward professionals. However, the code is not limited to this particular service. You are free to adapt the source to suit your needs.
Google Search Scraper PHP code
The source code is written in PHP and is ready to be used immediately. You can either make an agreement with us-proxies for IP addresses or replace the relevant parts and use your own IP solution. Before using the source code please read the license agreement.
Example output
Here is an example result-set from a test-run:Keyword: Scraping PHP
!Ranking information for keyword "Scraping PHP" ! !Rank [Type] - Website - Title! 1 [organic] - http://stackoverflow.com/questions/34120/html-scraping-in-php - HTML Scraping in Php - Stack Overflow 2 [organic] - http://www.oooff.com/php-scripts/basic-php-scrape-tutorial/basic-php-scraping.php - Basic PHP Web Scraping Script Tutorial - Oooff.com 3 [organic] - http://anchetawern.github.io/blog/2013/08/07/getting-started-with-web-scraping-in-php - Getting Started with Web Scraping in PHP - Wern Ancheta 4 [organic] - http://simplehtmldom.sourceforge.net/ - PHP Simple HTML DOM Parser 5 [organic] - http://www.jacobward.co.uk/web-scraping-with-php-curl-part-1/ - Web Scraping With PHP & CURL [Part 1] | Jacob WardJacob Ward 6 [organic] - https://github.com/fabpot/goutte - fabpot/Goutte · GitHub 7 [organic] - http://www.instructables.com/id/Beginning-web-page-scraping-with-php/ - Beginning web page scraping with php. - Instructables 8 [organic] - https://www.phparch.com/books/phparchitects-guide-to-web-scraping-with-php/ - php|architect's Guide to Web Scraping with PHP « php[architect ... 9 [organic] - http://scraping.pro/scraping-in-php-with-curl/ - Scraping in PHP with cURL - Web Scraping 10 [organic] - http://www.ymc.ch/en/webscraping-in-php-with-guzzle-http-and-symfony-domcrawler - Webscraping in PHP with Guzzle HTTP and Symfony DomCrawler ... 11 [organic] - https://code.tutsplus.com/tutorials/html-parsing-and-screen-scraping-with-the-simple-html-dom-library--net-11856 - HTML Parsing and Screen Scraping with the Simple HTML DOM ... 12 [organic] - http://www.eppie.net/simple-php-scraper-class/ - Simple PHP Scraper Class | - Eppie.net 13 [organic] - http://jacerdass.wordpress.com/2013/07/17/web-scrapping-done-right-using-php/ - Web scraping done right using PHP | Jacer Omri's Blog 14 [organic] - http://www.youtube.com/watch?v=632ql93H90g - Scraping Websites with PHP using DOMDocument and DOMXpath ... 15 [organic] - http://www.youtube.com/watch?v=Uv4eASStpas - PHP web scraping tutorial 1 : Automated Registration Form - YouTube 16 [organic] - http://www.devhour.net/filling-out-forms-with-php-and-curl/ - Scraping data with PHP and cURL Devhour 17 [organic] - http://www.thefutureoftheweb.com/blog/web-scrape-with-php-tutorial - Easy web scraping with PHP - The Future of the Web » Articles » 18 [organic] - http://www.devblog.co/php-web-page-scraping-tutorial/ - PHP Web Page Scraping Tutorial | DevBlog.co 19 [organic] - http://www.notprovided.eu/six-tools-web-scraping-use-data-journalism-creating-insightful-content/ - 6 tools for scraping - Use for datajournalism & insightful content 20 [organic] - http://www.packtpub.com/web-scraping-with-php/book - Instant PHP Web Scraping [Instant] | Packt Publishing 21 [organic] - http://showmethecode.es/php/php-goutte-una-libreria-para-hacer-web-scraping/ - PHP: Goutte una librerÃa para hacer web scraping - Show me the code 22 [organic] - http://www.sitepoint.com/image-scraping-symfonys-domcrawler/ - Image Scraping with Symfony's DomCrawler - SitePoint 23 [organic] - http://www.amazon.com/Instant-PHP-Scraping-Jacob-Ward-ebook/dp/B00E7NC9CS - Amazon.com: Instant PHP Web Scraping eBook: Jacob Ward: Kindle ... 24 [organic] - http://code.google.com/p/universal-web-scraper/ - Universal Web Scraper - Google Code 25 [organic] - http://hackaday.com/2012/12/10/web-scraping-tutorial/ - Web scraping tutorial - Hack a Day 26 [organic] - http://www.tdbowman.com/?p=426 - Web Scraping Using PHP and jQuery | Managing My Impression 27 [organic] - https://barebonescms.com/documentation/ultimate_web_scraper_toolkit/ - Ultimate Web Scraper Toolkit Documentation - Barebones CMS 28 [organic] - http://www.phpclasses.org/package/1754-PHP-Extract-structured-data-from-remote-HTML-pages.html - PHP Scraper: Extract structured data from remote HTML pages ... 29 [organic] - http://imbuzu.wordpress.com/2013/06/26/web-scraping-with-php/ - Web Scraping with PHP | Buzu's Oficial Blog 30 [organic] - http://developer.yahoo.com/yql/guide/yql-code-examples.html - YQL Code Examples - YDN 31 [organic] - http://blog.wlindley.com/2013/07/easy-screen-scraping-in-php/ - Easy screen scraping in PHP | A journal of my take on this wacky world 32 [organic] - http://www.mozenda.com/php-screen-scrape - PHP Screen Scrape Software Program Tool set - Mozenda 33 [organic] - http://acrl.ala.org/techconnect/?p=3850 - Web Scraping: Creating APIs Where There Were None ACRL ... 34 [organic] - http://www.matthewwatts.net/tutorials/php-tutorial-2-advanced-data-scraping-using-curl-and-xpath/ - PHP Tutorial 2: Advanced Data Scraping Using cURL And XPATH ... 35 [organic] - http://www.martinhurford.com/screen-scraping-with-php-querypath.html - Screen Scraping with PHP and QueryPath - Martin Hurford 36 [organic] - http://www.webmasterworld.com/php/4652704.htm - Web scraping PHP Server Side Scripting forum at WebmasterWorld 37 [organic] - https://leanpub.com/web-scraping - Web Scraping for PHP… by sameer borate [Leanpub PDF/iPad/Kindle] 38 [organic] - http://www.akshitsethi.me/parsing-web-pages-in-php/ - Parsing web pages in PHP | Akshit Sethi 39 [organic] - http://google-scraper.squabbel.com/ - Scraping Google for Fun and Profit 40 [organic] - http://blog.cnizz.com/2012/10/12/scrape-faster-with-php-domdocument-and-safely-with-tor/ - Scrape Faster with PHP DomDocument and Safely with Tor | Chris ... 41 [organic] - https://classic.scraperwiki.com/docs/php/php_intro_tutorial/ - Documentation / First scraper tutorial | ScraperWiki 42 [organic] - http://snipplr.com/view/22188/ - Easy scraping and HTML parsing with PHP5 and XPath - PHP ... 43 [organic] - http://lab.abhinayrathore.com/imdb/ - Free PHP ASP.net C# VB.net IMDb Scraper API and Web Service ... 44 [organic] - http://www.lie-nielsen.com/scraping-planes/large-scraping-plane/ - Large Scraping Plane - Lie-Nielsen Toolworks 45 [organic] - http://saturnboy.com/2010/03/scraping-google-groups/ - Scraping Google Groups « Saturnboy 46 [organic] - http://www.amazon.co.uk/Instant-PHP-Scraping-Jacob-Ward/dp/1782164766 - Instant PHP Web Scraping: Amazon.co.uk: Jacob Ward: Books 47 [organic] - http://sledgedev.com/build-a-scraper-with-php/ - Sledge Dev – Build a scraper with php 48 [organic] - http://www.maxprog.com/forum/viewtopic.php?f=11 - Maxprog Forum • View topic - scraping php sites? 49 [organic] - http://books.google.com/books?id=Q-cEMrCWckkC - Instant PHP Web Scraping - Google Books Result 50 [organic] - https://www.odesk.com/o/profiles/users/_~01d067ffb7cb06ee0e/ - Sandip Debnath - Proxy&Login-Bots/Scraping/Php/Regex/Ai/Ajax ... 51 [organic] - http://saf33r.com/web-scraping-101-with-php-and-goutte - Web Scraping 101 with PHP and Goutte | Safeer 52 [organic] - http://www.redscraper.com/blog/basic-of-web-scraping-using-php/ - Basic of Web Scraping Using PHP | Redscraper Blog 53 [organic] - http://skybluesofa.com/blog/how-use-phps-domdocument-scrape-web-page/ - How to Use PHP's DOMDocument to Scrape a Web Page - Sky Blue ... 54 [organic] - http://webdata-scraping.com/data-scraping-pdf-files-using-php/ - How to do data scraping from PDF files using PHP? | WebData ... 55 [organic] - http://www.russellbeattie.com/blog/using-php-to-scrape-web-sites-as-feeds - Using PHP to scrape web sites as feeds - Russell Beattie 56 [organic] - http://www.slideshare.net/tobias382/web-scraping-with-php-presentation - Web Scraping with PHP - SlideShare 57 [organic] - http://www.hochmanconsultants.com/articles/stop-email-spam.shtml - Code to Prevent Email Address Scraping and Form Spam via PHP ... 58 [organic] - http://www.developertutorials.com/tutorials/php/easy-screen-scraping-in-php-simple-html-dom-library-simplehtmldom-398/ - Easy Screen Scraping in PHP with the Simple HTML DOM Library 59 [organic] - http://thinkdiff.net/php/php-for-web-scraping-and-bot-development/ - PHP for Web scraping and bot development | Thinkdiff.net 60 [organic] - http://rojan.com.np/scraping-nodejs-vs-php/ - Rojan's blog | Scraping – Nodejs Vs Php 61 [organic] - http://www.barattalo.it/2013/12/08/php-jquery-dom-navigating-scrape-spider/ - Scraping content with PHP as if was jQuery, PHP jQuery like methods 62 [organic] - http://www.webdeveloper.com/forum/showthread.php?230985-Blocking-php-curl-from-scraping-website-content - Blocking php curl from scraping website content - WebDeveloper.com 63 [organic] - http://www.reddit.com/r/PHP/comments/1xiygj/what_is_the_best_php_library_for_scraping/ - What is the best php library for scraping websites, and filling out ... 64 [organic] - http://neerajpro.wordpress.com/2013/09/16/web-scraping-and-bot-development-using-php/ - web scraping and bot development using PHP | OPEN LEARNING 65 [organic] - http://wiki.vuze.com/w/Scrape - Scrape - VuzeWiki 66 [organic] - http://papermashup.com/use-jquery-and-php-to-scrape-page-content/ - Use jQuery and PHP to scrape page content | Papermashup.com 67 [organic] - http://ctrlq.org/code/19064-web-scraping-amazon - Web Scraping Amazon with PHP | The Programmer's Library 68 [organic] - https://www.facebook.com/apps/site_scraping_tos_terms.php - Automated Data Collection Terms - Facebook 69 [organic] - http://tyler.io/2008/05/scraping-imdb-with-php/ - Scraping IMDB With PHP | tyler.io 70 [organic] - http://www.coderanch.com/t/549196/PHP/Solved-Regular-Expressions-Scraping - [Solved] Help With Regular Expressions/Scraping (PHP forum at ... 71 [organic] - http://web3o.blogspot.com/2010/10/php-imdb-scraper-for-new-imdb-template.html - FREE! PHP IMDb Scraper/API for new IMDb Template 72 [organic] - http://superuser.my/web-scraping-ganon-php/ - Web Scraping Using Ganon PHP Library | superuser.my 73 [organic] - http://www.hmp.is.it/scraping-a-site-with-php/ - Simple way of scraping a website using PHP - hmp.is.it 74 [organic] - http://www.scriptrr.com/ - Website Scraper | Forum Crawler | Screen Scrapping | Data Mining ... 75 [organic] - http://scraperblog.blogspot.com/2013/07/php-scrape-website-with-rotating-proxies.html - ScraperBlog: Php - scrape website with rotating proxies 76 [organic] - http://wiki.xbmc.org/index.php?title=Naming_video_files/TV_shows - Naming video files/TV shows - XBMC 77 [organic] - https://packagist.org/search/?tags=scraper - Scraper - Packagist 78 [organic] - http://codedit.com/php/web-scraping-with-php-curl - Codedit.com | Web Scraping with PHP & CURL 79 [organic] - http://www.screen-scraper.com/products/all.php - Web scraping software | screen-scraper.com 80 [organic] - http://www.warriorforum.com/programming-talk/530802-scraping-websites-use-php-regexp-something-else.html - Scraping websites - use PHP and Regexp or something else ... 81 [organic] - http://codeatomic.com/services/web-scraping/ - Code Atomic Web scraping php (web harvesting or web data ... 82 [organic] - http://jon.netdork.net/2011/02/21/nagios-web-scraping-and-php-as-an-agent - Nagios, web scraping, and PHP as an agent - TheGeekery 83 [organic] - http://www.scrapegoat.com/faqs.php - FAQs Page - Data Mining and Screen Scraping from ScrapeGoat.com 84 [organic] - http://www.indeed.com/q-PHP-Scraping-jobs.html - PHP Scraping Jobs, Employment | Indeed.com 85 [organic] - https://forums.digitalpoint.com/threads/php-screen-scraping-specific-data.2680501/ - PHP screen scraping specific data - Digital Point Forums 86 [organic] - http://codereview.stackexchange.com/questions/40538/why-is-my-web-scraping-script-so-slow - php - Why is my web scraping script so slow? - Code Review Stack ... 87 [organic] - http://www.freelancer.com/jobs/Web-Scraping/ - Web Scraping Jobs and Contests | Freelancer.com 88 [organic] - http://www.fiverr.com/systemexpert/code-a-php-scraper-that-will-scrape-5-items-from-a-website-of-your-choice - code a php scraper that will scrape 5 items from a website of your choi 89 [organic] - http://forums.macrumors.com/showthread.php?t=1689584 - Setting up a web scraping system - MacRumors Forums 90 [organic] - http://www.4shared.com/office/CC-9NLJn/php_architects_guide_to_web_sc.html - php architect's guide to web scraping with php - Download - 4shared 91 [organic] - http://raphaelstolt.blogspot.com/2008/10/scraping-websites-with-zenddomquery.html -Requirements: * PHP 5.2 or higher, PHP libCURL and PHP DOM * user permissions to write at the local directory (caching) * us proxies API support (professional IP provider) Download the source code here: search-engine-scraper.php functions-ses.php simple_html_dom.php: Scraping websites with Zend_Dom_Query 92 [organic] - http://www.codefire.org/blogs/item/data-scraping-using-curl-in-php.html - Data scraping using cURL in PHP - CodeFire 93 [organic] - http://matthewturland.com/2010/04/20/web-scraping-with-php-now-available/ - Matthew Turland » Blog Archive » “Web Scraping with PHP†Now ... 94 [organic] - http://www.xmarks.com/site/www.bradino.com/php/screen-scraping/ - PHP Screen Scraping Tutorial - Xmarks 95 [organic] - http://www.dmxzone.com/go/4402/page-scraping/ - Page Scraping - Articles - DMXzone.COM 96 [organic] - http://blog.makewebsmart.com/scraping-library-for-codeigniter-framework/136 - Scraping library for CodeIgniter Framework | MakeWebSmart 97 [organic] - http://www.phpninja.info/blog/2013/08/crawling-scraping-app-store-andor-android-market/ - Crawling and Scraping App Store and/or Android Market - Php Ninja 98 [organic] - http://www.phpdeveloper.org/tag/scraping - scraping - PHPDeveloper: PHP News, Views and Community 99 [organic] - http://www.phpbuilder.com/columns/marc_plotz011410.php3 - PHPBuilder - Build a PHP Link Scraper with cURL 100 [organic] - http://www.archiveteam.org/index.php?title=URLTeam - URLTeam - Archiveteam 101 [organic] - http://devzone.zend.com/1087/php-abstract-episode-22-screen-scraping/ - PHP Abstract Episode 22: Screen Scraping | Zend Developer Zone 102 [organic] - http://www.ngo-hung.com/blog/2012/11/03/list-of-open-source-screen-scraping-tools - List of open source screen scraping tools - Ngo The Hung's blog 103 [organic] - http://entropytc.com/screen-scraping-with-php/ - Screen scraping with PHP - Entropy Technical Consulting 104 [organic] - http://www.fromzerotoseo.com/scraping-websites-php-curl-proxy/ - Scraping websites with PHP cURL under proxy | From Zero To SEO 105 [organic] - http://www.yiiframework.com/extension/yiiscrapermodule/ - yiiscrapermodule | Extension | Yii PHP Framework 106 [organic] - https://docs.google.com/document/d/18Q2THQvYCG2_n6nKVsZRHlaPG9iJ9NvLezOOQbEuAJs/edit?hl=en - Tipsheet: Web Scraping for Non-Programmers - Google Drive 107 [organic] - http://www.digeratimarketing.co.uk/2008/12/16/curl-page-scraping-script/ - CURL Page Scraping Script - Digerati Marketing 108 [organic] - http://www.shekhargovindarajan.com/scripts/web-scraping-with-firefox-and-php-using-xpath/ - Web Scraping with Firefox and PHP, using XPath | Shekhar ... 109 [organic] - http://www.quickscrape.com/ - QuickScrape | Quick php html scraper and crawler for scraping and ... 110 [organic] - http://www.linkedin.com/groups/Php-Web-Html-Content-Scraping-4818098 - Php Web Html Content Scraping Help | LinkedIn 111 [organic] - http://forum.codecall.net/topic/77005-scraping-charts-from-this-website/ - Scraping charts from this website? - PHP - Codecall 112 [organic] - https://www.elance.com/r/contractors/q-PHP%20cURL%20Data%20Scraping - Find PHP cURL Data Scraping Freelancers & Contractors 113 [organic] - http://php.dzone.com/news/gotcha-scraping-net - Gotcha on Scraping .NET Applications with PHP and cURL | PHP ... 114 [organic] - https://itunes.apple.com/us/book/instant-php-web-scraping/id680880119?mt=11 - iTunes - Books - Instant PHP Web Scraping by Jacob Ward 115 [organic] - http://www.zacharydavidbiles.com/2012/05/scraping-pinterest-with-php/ - Scraping Pinterest with PHP | Zach Biles – Cartersville, GA Web ... 116 [organic] - http://www.ebook3000.com/php-architect-s-Guide-to-Web-Scraping-with-PHP_113893.html - php|architect's Guide to Web Scraping with PHP - Free eBooks ... 117 [organic] - http://www.weblee.co.uk/2009/06/18/simple-dom-helper-for-codeigniter/ - Simple Dom Helper codeigniter | Screen Scraping | PHP ... - Web Lee 118 [organic] - http://www.nicolasmarin.com/web-scraper-con-php/ - Web scraper con PHP | Nicolás MarÃn 119 [organic] - http://www.quora.com/Web-Scraping/How-do-you-scrape-asp-or-php-pages - Web Scraping: How do you scrape .asp or .php pages? - Quora 120 [organic] - http://www.urbandictionary.com/define.php?term=scraper - Urban Dictionary: scraper 121 [organic] - http://forums.phpfreaks.com/topic/276972-scraping-the-data-from-website/ - scraping the data from website - PHP Coding Help - PHP Freaks 122 [organic] - http://www.h-net.org/reviews/showrev.php?id=37101 - H-Net Reviews 123 [organic] - http://www.connotate.com/technology/product - Automated Web Data Collection | Intelligent Web Scraping | Hosted ... 124 [organic] - http://phptrends.com/dig_in/scraping - scraping - PHP Trends, libraries and frameworks 125 [organic] - http://www.tonido.com/blog/index.php/2013/12/28/web-scraping-and-legal-issues/ - Web Scraping and Legal Issues - Tonido 126 [organic] - http://elanmarikit.me/2011/03/scraping-aspnet-page-in-php-curl.html - Scraping ASP.NET page in PHP Curl | PHP/Web Development 127 [organic] - http://www.r-bloggers.com/scraping-table-from-any-web-page-with-r-or-cloudstat/ - Scraping table from any web page with R or CloudStat | (R news ... 128 [organic] - http://www.peopleperhour.com/freelance/web+scraping+php+curl - Web scraping php curl - PeoplePerHour.com 129 [organic] - http://dayat.net/introduction-to-scraping-techniques/ - Introduction To Scraping Techniques | Dayat Technologies 130 [organic] - http://robertbasic.com/blog/book-review-guide-to-web-scraping-with-php - Book review - Guide to Web Scraping with PHP ~ Robert Basic ~ the ... 131 [organic] - http://forums.whirlpool.net.au/archive/1983474 - Running a PHP scraping script - Programming - Whirlpool Forums 132 [organic] - http://www.adminspoint.com/programming/296-easy-screen-scraping-php-server-side-scripting-language-simple-html-dom-library.html - Easy Screen Scraping in PHP with the Simple HTML DOM Library 133 [organic] - http://www.hotscripts.com/forums/php/114448-data-scraping-question.html - Data Scraping Question - Hot Scripts Forums 134 [organic] - http://www.pearltrees.com/mic100/php-scraping/id4775553 - Php scraping | Pearltrees 135 [organic] - http://hublog.hubmed.org/archives/001558.html - HubLog: Scraping web pages with PHP 5 136 [organic] - http://blog.hartleybrody.com/web-scraping/ - I Don't Need No Stinking API: Web Scraping For Fun and Profit 137 [organic] - http://www.blackhatworld.com/blackhat-seo/black-hat-seo/565471-dev-php-crawler-scraping-video-sites.html - [DEV] PHP crawler for scraping video sites - Black Hat World 138 [organic] - http://deepinthecode.com/2014/02/28/scraping-div-element-web-page-php/ - Scraping a DIV Element from a Web Page with PHP – Deep in the ... 139 [organic] - http://ao2.it/en/blog/2013/07/07/tweeper-twitter-rss-web-scraper - Tweeper: a Twitter to RSS web scraper | en hacking | ao2.it 140 [organic] - http://bz9.com/index.php/youtube-scraper/ - YouTuber :: YouTube Scraper - BZ9.com 141 [organic] - https://phpacademy.org/topics/html-web-scraping-with-php/33032 - HTML Web Scraping with PHP | phpacademy 142 [organic] - http://blogoscoped.com/archive/2004_06_23_index.html - Screen-scraping With PHP5 | Googlebot Alert | Gmail Hype Ending ... 143 [organic] - http://superuser.com/questions/179253/how-legal-is-site-scraping-using-curl - php - How "legal" is site-scraping using cURL? - Super User 144 [organic] - http://osdir.com/ml/org.user-groups.php.uphpu/2008-09/msg00075.html - org.user-groups.php.uphpu - Web site scraping - msg#00075 ... 145 [organic] - http://my.safaribooksonline.com/book/programming/php/9781782164760/1dot-instant-php-web-scraping/ch01s09_html - Instant PHP Web Scraping > 1. Instant PHP Web Scraping ... 146 [organic] - https://discussion.dreamhost.com/thread-125593.html - php curl screen scraping program needs an if fork - DreamHost Forum 147 [organic] - http://www.daniweb.com/web-development/php/threads/289020/blocking-php-curl-from-scraping-website-content - Blocking php curl from scraping website content | DaniWeb 148 [organic] - http://leandroarts.com/how-to-scrape-google-search-results-for-query-popularity-with-php/ - How to scrape Google search results for query popularity with PHP ... 149 [organic] - http://jimblackler.net/blog/?p=13 - Jim Blackler · Scraping text from Wikipedia using PHP 150 [organic] - http://www.mishainthecloud.com/2009/12/screen-scraping-aspnet-application-in.html - Misha in the Cloud: Screen-scraping an ASP.NET application in PHP 151 [organic] - http://ehelion.net/projects/htmlscrape/scrape.html - Collecting data using HTML scraping - ehelion.com 152 [organic] - http://www.wellho.net/resources/ex.php4?item=h307/scraper.php - Scraping a remote URL content - PHP example 153 [organic] - http://horusss2.wordpress.com/2009/12/05/use-php-dom-parser-for-more-robust-screen-scraping/ - Use PHP DOM Parser for more robust screen scraping | THIS BLOG ... 154 [organic] - http://www.amitsamtani.com/2010/03/30/web-scraping-using-php-and-xpath/ - Web Scraping using PHP and XPath - amitsamtani.com 155 [organic] - http://99webtools.com/extract-website-data.php - Extract website data using php - Web tools 156 [organic] - http://www.iwebscraping.com/Web_Scraping_Service.php - Web Scraping Service | Web Data Scraping | Website Scraping 157 [organic] - http://www.windbusinessfactor.it/storage/video/1309/-php-architects--guide-to-web-scraping-with-php.pdf - php|architect's Guide to Web Scraping with PHP - Wind Business ... 158 [organic] - http://www.computerhope.com/forum/index.php?topic=129466.0 - PHP cURL (Scraping a website) - Computer Hope 159 [organic] - http://scrapedefender.com/education/web-scraping-job-listings/ - Data and Web Scraping Job Listings | Scrape Defender 160 [organic] - http://wordpress.org/plugins/wp-web-scrapper/other_notes/ - WordPress › WP Web Scraper « WordPress Plugins 161 [organic] - http://phpcircle.net/content/website-scraping-advantages-php - Website Scraping Advantages With PHP !! | PHPCircle 162 [organic] - http://devtrench.com/posts/screen-scrape-with-php-curl - Screen Scraping: How to Screen Scrape a Website with PHP and ... 163 [organic] - http://forums.devshed.com/php-development-5/scraping-aspx-site-php-799426.html - Scraping an aspx site with php - Dev Shed Forums 164 [organic] - http://www.internetnews.com/ec-news/article.php/3334651 - Google Moves to Block RSS Scraping - InternetNews. 165 [organic] - http://softadvice.informer.com/Php_Email_Scraper.html - Php Email Scraper - free download suggestions - Software Advice 166 [organic] - http://sourabhjainblog.wordpress.com/2013/11/13/scraping-websites-with-php-curl-under-proxy/ - Scraping websites with PHP cURL under proxy | Sourabh Jain - php ... 167 [organic] - http://nbviewer.ipython.org/url/www.unc.edu/~ncaren/Lax-1.ipynb.json - Web scraping in Python - IPython Notebook Viewer 168 [organic] - http://scrollingtext.org/using-curl-and-user-agent-string-web-scraping-pt-2-now-php - Using curl and a user agent string for web scraping pt 2; Now with PHP 169 [organic] - http://blog.amhill.net/2010/09/17/scraping-twitpics-with-php-coding/ - Scraping Twitpics with PHP [Coding] | Blog.amhill 170 [organic] - http://corgitoergosum.net/2011/01/17/replicating-flipboard-part-i-site-scraping/ - Replicating Flipboard Part I – Site Scraping | Cogito Ergo Sum 171 [organic] - http://www.earthinfo.org/xpaths-with-php-by-example/ - XPaths with PHP by example « Earth Info 172 [organic] - https://trac.transmissionbt.com/ticket/4158 - (scraping trackers of form "announce.php?key ... - Transmission 173 [organic] - http://harmssite.com/2012/01/scraping-a-page-with-php - Scraping a page with php - HarmsSite 174 [organic] - http://bytes.com/topic/php/answers/889713-blocking-php-curl-scraping-website-content - Blocking php curl from scraping website content - PHP - Bytes 175 [organic] - http://blog.digitalmethods.net/2010/asimpletwitterscraper/ - A simple Twitter scraper - Digital Methods Initiative 176 [organic] - http://www.satya-weblog.com/2010/11/play-with-yql-html-scraping-using-yql-and-php.html - Play with YQL: HTML Scraping using YQL and PHP - Satya's Weblog 177 [organic] - https://www.e-education.psu.edu/geog863/l6_p6.html - Web Scraping | GEOG 863: Mashups - e-Education Institute 178 [organic] - http://php.find-info.ru/php/010/phphks-CHP-5-SECT-12.html - PHP: Hack 44. Scrape Web Pages for Data 179 [organic] - https://support.startpage.com/index.php?/Knowledgebase/Article/View/188/23/how-does-startpage-prevent-scraping-and-abuse-without-recording-ip-addresses - How does StartPage prevent scraping and abuse without recording ... 180 [organic] - http://www.seerinteractive.com/blog/scraping-for-dummies-with-outwit-a-marketers-best-friend - Scraping for Dummies with Outwit (a Marketer's Best Friend) | SEER ... 181 [organic] - http://health.mo.gov/lab/scabies.php - Skin Scraping Exam | State Public Health Laboratory | Health ... 182 [organic] - http://junseewebdesigner.wordpress.com/2013/08/05/php-scrape-a-wordpress-feed/ - PHP Scrape a WordPress Feed | Junsee 183 [organic] - http://blog.matthewdfuller.com/2012/07/defeating-x-frame-options-with-scraping.html - Matthew D Fuller - Blog: Defeating X-Frame-Options with Scraping 184 [organic] - http://www.garysieling.com/blog/scraping-google-maps-search-results-with-javascript-and-php - Scraping Google Maps Search Results With Javascript And PHP ... 185 [organic] - http://tellini.info/2011/05/scraping-mac-app-store-reviews/ - Scraping Mac App Store reviews | Simone Tellini 186 [organic] - http://forums.thedailywtf.com/forums/p/8578/162940.aspx - Lame PHP Screen Scraping - TDWTF Forums 187 [organic] - http://www.tutorialized.com/tutorial/Wikipedia-Content-Scraper-in-PHP/81662 - PHP Web Fetching Wikipedia Content Scraper in PHP Tutorial 188 [organic] - http://www.dreamincode.net/forums/topic/9687-programatically-logging-in-and-page-scraping/ - Programatically Logging In And Page Scraping - PHP | Dream.In.Code 189 [organic] - http://alexdglover.com/web-scraping-php-and-wheel-of-fortune/ - Alex D Glover Web Scraping, PHP, and Wheel of Fortune - Fun ... 190 [organic] - http://itsrj.com/2010/12/24/scraping-sites-using-curl-xpath/ - Scraping Sites Using cURL & XPath | it's rj 191 [organic] - http://scraperlab.com/ - ScraperLab | Web Scrapers Generator 192 [organic] - http://www.gamegecko.com/game/204/scrape - Scrape - GameGecko.com 193 [organic] - http://ask.metafilter.com/98518/Web-scraping-for-dummies - Web scraping for dummies - php mysql programming | Ask MetaFilter 194 [organic] - http://kbeezie.com/scraping-google-results/ - Scraping Google Front Page Results » KBeezie 195 [organic] - http://forums.thetvdb.com/viewtopic.php?f=4 - TheTVDB.com • View topic - 503 errors using the API / Errors ... 196 [organic] - http://forums.devnetwork.net/viewtopic.php?f=1 - screen scraping a site which uses AJAX • PHP Developers Network 197 [organic] - http://technoloid.blogspot.com/2012/03/screen-scraping.html - Screen Scraping Tumblr Using Curl | Technoloid 198 [organic] - http://themanwhosoldtheweb.com/craigslist-email-scraper.php?tol - Craigslist Email Scraper - TheManWhoSoldtheWeb.com 199 [organic] - http://forums.digitizedesign.com/topic/1604-beginner-scraping-script-with-php-and-curl/ - Beginner scraping script with PHP and cURL - PHP - Digitize Design 200 [organic] - http://readwrite.com/2012/02/24/data-scraping-comes-of-age-wit - Data Scraping Comes of Age With ScraperWiki.com – ReadWrite 201 [organic] - http://www.binaryspark.com/classes/Art-of-the-scrape.pdf - Art of the scrape!!!! - BinarySpark.com 202 [organic] - http://rhodesmill.org/brandon/chapters/screen-scraping/ - Chapter 10: Screen Scraping by Brandon Rhodes - Rhodes Mill 203 [organic] - http://php.bigresource.com/Scraping-a-Secure-Site-3QvPycau.html - PHP :: Scraping A Secure Site 204 [organic] - http://www.nickycakes.com/scraping-websites-for-fun-and-profit-part-2/ - Scraping Websites for Fun and Profit Part 2 | NickyCakes.com 205 [organic] - http://books.google.com/books/about/PHP_Architect_s_Guide_to_Web_Scraping.html?id=H6O9cQAACAAJ - PHP-Architect's Guide to Web Scraping - Matthew Turland - Google ... 206 [organic] - https://community.x10hosting.com/threads/php-xpath-scraping-data-from-a-page.101059/ - PHP - XPATH - Scraping Data From A Page | x10Hosting Community 207 [organic] - http://nicklewis.org/node/962 - Stupid Simple Web Scraping with SimpleXML | Nick Lewis: The Blog 208 [organic] - http://www.newthinktank.com/2010/11/python-2-7-tutorial-pt-13-website-scraping/ - Python 2.7 Tutorial Pt 13 Website Scraping - New Think Tank 209 [organic] - http://byronwhitlock.com/FastCrawl/ - Whitlock Web Development - Fast Crawl PHP Web crawl framework 210 [organic] - http://gablaxian.com/2013/06/18/scraping-twitter-feeds-with-nodejs.html - Scraping Twitter Feeds with NodeJS | gablaxian.com 211 [organic] - http://programming.textures-tones.com/2012/01/30/basic-screen-scraping-part-1-basic-xml-parsing/ - Basic Screen Scraping – Part 1, Basic XML Parsing | programming ... 212 [organic] - http://www.nmdnet.org/2011/09/01/best-web-host-for-web-scraping-application/ - Best Web host for Web scraping application? » UMaine NMDNet 213 [organic] - http://thewebscraping.com/web-scraper-open-source-3/ - Web scraper open source | The Web Scraping 214 [organic] - http://skookum.com/blog/scraping-poorly-formatted-data-with-curl-and-phpquery/ - Scraping Poorly Formatted Data with cURL and phpQuery ... 215 [organic] - http://www.customwebscraping.com/php-web-scraping - PHP Web Scraping | Andrade Global 216 [organic] - http://www.lightspeedretail.com/blog/ - Retail Industry Blog – LightSpeed Retail POS 217 [organic] - https://www.distilled.net/blog/seo/building-your-own-scraper-for-link-analysis/ - Building Your Own Scraper for Link Analysis | Distilled 218 [organic] - http://datajournalismhandbook.org/1.0/en/getting_data_3.html - Getting Data from the Web - The Data Journalism Handbook 219 [organic] - http://jafty.com/blog/scraping-with-curl-using-cookies/ - Scraping with Curl using Cookies | Jafty Interactive Web Development 220 [organic] - http://zrashwani.com/simple-web-spider-php-goutte/ - Simple web spider with PHP Goutte | Z.Rashwani Blog 221 [organic] - http://blog.redbranch.net/2011/10/28/php-web-scraping-for-munin/ - PHP Web Scraping for Munin » Red Branch 222 [organic] - http://answers.google.com/answers/threadview/id/785059.html - Google Answers: Webscraping and WebMacros software 223 [organic] - http://www.topprojectshub.com/ - Outsourcing Data Entry, Data Scraping, Document Scanning, PHP ... 224 [organic] - http://opensourcebridge.org/sessions/97 - Web Scraping with PHP / Open Source Bridge: The conference for ... 225 [organic] - http://www2.imm.dtu.dk/pubdb/views/edoc_download.php/6183/pdf/imm6183.pdf - Algorithms for Web Scraping 226 [organic] - http://www.armstrong-chemtec.com/rm/index.php?option=com_content - Scraped Surface Crystallizers 227 [organic] - https://joind.in/435 - Talk: Web Scraping with PHP - Joind.in 228 [organic] - https://thomashunter.name/blog/open-sourcing-my-php-web-scraper/ - Open Sourcing my PHP Web Scraper - Thomas Hunter II 229 [organic] - http://ubuntuforums.org/showthread.php?t=1259548 - [other] PHP scraping Help - Ubuntu Forums 230 [organic] - http://www.sunilb.com/php/writing-website-scrapers-in-php - Writing Website Scrapers in PHP | Geek Files 231 [organic] - http://blog.ericlamb.net/2009/01/a-journey-into-php-cli-and-scraping/ - A journey into php-cli and scraping | Made of Everything You're Not 232 [organic] - http://www.troywolf.com/articles/php/class_http/ - PHP class_http from Troy Wolf 233 [organic] - http://tutorialzine.com/2013/02/24-cool-php-libraries-you-should-know-about/ - 24 Cool PHP Libraries You Should Know About | Tutorialzine 234 [organic] - http://www.logaholic.de/2009/06/01/elegant-oop-html-scraping-with-domdocument/ - Elegant OOP HTML scraping with DOMDocument - Logaholic.de 235 [organic] - http://capelinks.net/about/internet/spamdexing/ - Spamdexing: Scrape-O-Rama ~ CapeLinks Internet Services 236 [organic] - http://www.proscraper.com/ - Professional Scraper - Website Scraping, Crawling, Data Mining ... 237 [organic] - http://www.mrwebmaster.it/php/web-scraping-php_7568.html - Il Web Scraping in PHP | PHP | Mr.Webmaster 238 [organic] - http://adamyoung.net/Quickstart-to-PHP-Screen-Scraping - Quickstart to PHP Screen Scraping | Adam Young 239 [organic] - http://blog.svnlabs.com/craigslist-scraper-tool/ - Craigslist Scraper Tool | S V N Labs Softwares 240 [organic] - http://www.codediesel.com/php/web-scraping-in-php-tutorial/ - Web scraping tutorial - CodeDiesel 241 [organic] - http://curl.phptrack.com/forum/viewtopic.php?f=1 - CURL PHP Examples • View topic - Problem scraping url - PHP CURL ... 242 [organic] - http://pp19dd.com/2009/11/php-algorithm-for-scraping-and-converting-a-twitter-list-into-rss-format-with-super-fancy-xpath-queries-in-six-awesomely-easy-steps/ - PHP algorithm for scraping and converting a twitter list into RSS ... 243 [organic] - http://forum.phux.org/viewtopic.php?f=12 - phux Development • View topic - Data Scraping - MetaCritic.com 244 [organic] - http://www.net-security.org/malware_news.php?id=1641 - Malware-driven pervasive memory scraping - Help Net Security 245 [organic] - http://www.php-forum.com/phpforum/viewtopic.php?f=2 - www.php-forum.com • View topic - Site Scraping with PHP, HTML ... 246 [organic] - http://lamp-dev.com/php-website-scraping-using-chrome-web-driver/635 - PHP Website Scraping using Chrome Web Driver | LAMPDev ... 247 [organic] - http://www.ebookgoogle.com/633701-phparchitects-guide-web-scraping-php-repost - php|architect's Guide to Web Scraping with PHP (Repost) - Study ... 248 [organic] - http://www.techsupportforum.com/forums/f49/php-screen-scraping-596493.html - PHP screen scraping - Tech Support Forum 249 [organic] - http://board.issociate.de/thread/495564/Static-andor-Dynamic-site-scraping-using-PHP.html - Static and/or Dynamic site scraping using PHP 250 [organic] - http://www.simplyhired.com/k-scraping-php-jobs.html - Scraping Php Jobs | Job Search with Simply Hired 251 [organic] - http://www.script-home.com/php-multithreaded-scraping-of-the-page-implementation-code.html - PHP multithreaded scraping of the page implementation code ... 252 [organic] - http://rottentomatoesdatascraping.blogspot.com/2013/05/managing-online-data-by-php-web-scraping.html - Managing Online Data by PHP Web Scraping - Rottentomatoes.com ... 253 [organic] - http://www.freelancer.co.uk/projects/PHP-Software-Architecture/web-scraping-php-script.html - web scraping php script | PHP | Software Architecture 254 [organic] - http://www.b.shuttle.de/hayek/Hayek/Jochen/wp/blog-en/2011/11/17/book-guide-to-web-scraping-with-php/ - book: Guide to Web Scraping with PHP | Jochen Hayek's Blog in ... 255 [organic] - http://www.solveerrors.com/forums/scraping-an-aspx-site-with-php-33513.asp - Scraping an aspx site with php - SolveErrors.com 256 [organic] - http://umuwa.com/php-web-scraping-script-download - php web scraping script download - at Umuwa 257 [organic] - http://avaxsearch.com/?q=Web%20Scraping%20PHP - Web Scraping PHP - Data on AvaxHome 258 [organic] - http://www.filestube.to/p2/php+architect+s+guide+to+web+scraping+with+php - Php architect s guide to web scraping with php download - FilesTube 259 [organic] - http://efreedom.net/Question/1-34120/HTML-Scraping-Php - HTML Scraping in Php - efreedom 260 [organic] - http://efreedom.net/Question/1-1332590/HTML-Comment-Scraping-PHP - HTML comment scraping in PHP - efreedom 261 [organic] - http://www.getacoder.com/projects/view.php?id=144412 - Scraping PHP To Mysql Database (MySQL, PHP, PHP/IIS/MS SQL) 262 [organic] - http://www.donanza.com/jobs/p3057980-php_scraping_php_mysql_scraping - Php Scraping - Php Mysql Scraping for Max. $500 - DoNanza 263 [organic] - http://www.freelancer.is/projects/PHP-MySQL/Scraping-PHP-cURL-REGEX-Experts.html - Scraping, PHP, cURL, REGEX Experts | Data Mining ... - Freelancer.is 264 [organic] - http://www.freelancer.in/job-search/web-scraping-php-simplexml-script/ - web scraping php simplexml script Freelancers and Jobs ... 265 [organic] - http://www.freelancer.com.au/projects/PHP-Software-Architecture/Scraping-site-asp-php.html - Scraping site asp - php | PHP | Software Architecture 266 [organic] - http://www.freelancer.co.za/projects/Perl/Scraping-site-asp-php-repost.html - Scraping site asp - php - repost | Perl - Freelancer.co.za 267 [organic] - http://www.freelancer.com.bd/projects/PHP-Website-Design/PHP-script-for-data-scraping.html - PHP script for data scraping - Freelancer.com.bd 268 [organic] - http://www.freelancer.ph/projects/PHP-MySQL/Web-Scraping-PHP-Preferred.html - Web Scraping (PHP Preferred) | Anything Goes | MySQL | PHP ... 269 [organic] - http://www.freelancer.pk/projects/PHP-Web-Scraping/web-scraping-bot-submit-form.html - web scraping and bot to submit form iMacros or PHP | Data Mining ... 270 [organic] - http://www.freelancer.com.jm/projects/PHP-Software-Architecture/Webpage-scraping-php-mysql-script.html - Webpage scraping php+mysql script - Freelancer.com.jm 271 [organic] - http://coding.derkeiler.com/Archive/PHP/php.general/2005-11/msg00154.html - Re: Web Screen Scraping PHP Help 272 [organic] - http://www.workingbase.com/project/PHP-login-to-a-website-programatically.2785673.html - PHP login to a website programatically (Javascript, PHP, Web ... 273 [organic] - http://www.filestube.com/p/php+architect+s+guide+to+web+scraping - Php architect s guide to web scraping download - FilesTube 274 [organic] - http://savedhistory.org/k/web-scraping-ebook-php - Web Scraping Ebook Php - savedwebhistory.org 275 [organic] - http://hostcabi.net/websites/web-scraping-php - Web Scraping Php Websites - HostCabi.net 276 [organic] - http://books.google.com/books?id=dqI-AQAAMAAJ - The Iron Age - Google Books Result 277 [organic] - http://books.google.com/books?id=64I4AQAAMAAJ - The Literary Digest - Google Books Result 278 [organic] - http://books.google.com/books?id=P54zAQAAMAAJ - Annual Report of the Pennsylvania Agricultural Experiment Station - Google Books Result 279 [organic] - http://alaskagulfcoastexpeditions.com/tf/index.php?hl=lint+traps+for+dryers - Lint traps for dryers - Alaska Gulf Coast Expeditions 280 [organic] - http://books.google.com/books?id=7W0-AQAAMAAJ - Harper's New Monthly Magazine - Google Books Result 281 [organic] - http://www.trapperman.com/forum/ubbthreads.php/topics/4403841/all/First_Time_Fleshing_Beaver - First Time Fleshing Beaver | Trapper Talk | Trapperman.com Forums 282 [organic] - http://books.google.com/books?id=nTYxAQAAMAAJ - Engineering - Google Books Result 283 [organic] - http://books.google.com/books?id=pl8vAAAAYAAJ - The country - Google Books Result 284 [organic] - http://en.wikipedia.org/wiki/Scrap - Scrap - Wikipedia, the free encyclopedia 285 [organic] - http://forum.gamesports.net/dota/showthread.php?84583-Add-metadata-to-website - Add metadata to website 286 [organic] - http://forum.the-west.net/showthread.php?p=716823 - The Tiran Wars: Liberty, at all Costs - Page 83 - Forum The West 287 [organic] - http://www.horseandhound.co.uk/forums/showthread.php?659234-Following-on-from-the-weaving-thread - Following on from the weaving thread - Horse and Hound 288 [organic] - http://forums.digitalspy.co.uk/showthread.php?p=71966376 - Why do people still buy watches? - Page 15 - General Discussion ... 289 [organic] - http://washingtondc.craigslist.org/doc/cps/4391028736.html - Database and application development asp.net php - Craigslist 290 [organic] - http://forum.bodybuilding.com/index.php - Bodybuilding.com Forums - Bodybuilding And Fitness Board 291 [organic] - http://www.disboards.com/showthread.php?p=51078875 - David's DVC rental and MDE?? - The DIS Discussion Forums ... 292 [organic] - http://worldoftanks.mmmos.com/?page=view - Side scraping, a good example - World of Tanks - MMMOs 293 [organic] - http://www.redpowermagazine.com/forums/index.php?showtopic=85925 - Finally made something out of myself. - Page 2 - Coffee Shop - Red ... 294 [organic] - http://www.redpowermagazine.com/forums/index.php?showtopic=85956 - mudslide in Washington state - Page 2 - Coffee Shop - Red Power ... 295 [organic] - http://www.dice.com/job/result/10531322/517235?src=19 - PHP Developer - Aqua Systems Inc - Roslyn, NY | dice.com - 3-28 ... 296 [organic] - http://forums.winamp.com/showthread.php?p=2988914 - Are skins lost? - Winamp Forums 297 [organic] - http://kumb.com/forum/viewtopic.php?f=2 - Knees Up Mother Brown - West Ham United FC Online: Forum • View ... 298 [organic] - http://www.wbaunofficial.org.uk/forum/showthread.php?tid=24834 - Fulham and Cardiff gone for me 299 [organic] - http://abierta.cl/index.php/abierta-act/areas/itemlist/user/706-joomlayldo - joomlayldo - Comunidad Abierta Arte, Ciencia y TecnologÃa 300 [organic] - http://forums.probetalk.com/showthread.php?s=5365733991fb268c77b6d46da2f40edb - Detailing KLG4. How deep to I go? - ProbeTalk.com Forums
#!/usr/bin/php
<?php
/* License:
Open source for private and commercial use but this comment needs to stay untouched on top.
URL of original source code: http://scraping.compunect.com
Author of original source code: http://www.compunect.com
IP rotation API code from here: http://www.us-proxies.com/automate
Under no circumstances and under no legal theory, whether in tort (including negligence), contract, or otherwise, shall the Licensor be liable to anyone for any direct, indirect, special, incidental, or consequential damages of any character arising as a result of this License or the use of the Original Work including, without limitation, damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses. This limitation of liability shall not apply to the extent applicable law prohibits such limitation.
Usage exceptions:
Public redistributing modifications of this source code project is not allowed without written agreement.
Using this work for private and commercial projects is allowed, redistributing it is not allowed without our written agreement.
*/
ini_set("memory_limit","64M"); // For scraping 100 results pages 32MB memory expected, for scraping the default 10 results pages 4MB are expected. 64MB is selected just in case.
ini_set("xdebug.max_nesting_level","2000"); // precaution, might not be required. our parser will require a deep nesting level but I did not check how deep a 100 result page actually is.
error_reporting(E_ALL & ~E_NOTICE);
// ************************* Configuration variables *************************
// Your api credentials, you need a plan at us-proxies.com
// It's optional, you can remove the proxy related parts and just use it as a single-IP tool. Just make sure to implement a request delay of around 3-5 minutes in that case.
$pwd = your-key;
$uid = your-account-id;
// General configuration
$test_website_url = "website.com"; // The URL, or a sub-string of it, of the indexed website.
$test_keywords = "keyword,another keyword,more keywords"; // comma separated keywords to test the rank for
$test_max_pages = 3; // The number of result pages to test until giving up per keyword.
$test_100_resultpage = 0; // Warning: Google ranking results may become inaccurate
/* Local result configuration. Enter 'help' to receive a list of possible choices. use global and en for the default worldwide results in english
* You need to define a country as well as the language. Visit the Google domain of the specific country to see the available languages.
* Only a correct combination of country and language will return the correct search engine result pages. */
$test_country = "global"; // Country code. "global" is default. Use "help" to receive a list of available codes. [com,us,uk,fr,de,...]
$test_language = "en"; // Language code. "EN" is default Use "help" to receive a list. Visit the local Google domain to find available langauges of that domain. [en,fr,de,...]
$filter = 1; // 0 for no filter (recommended for maximizing content), 1 for normal filter (recommended for accuracy)
$force_cache = 0; // set this to 1 if you wish to force the loading of cache files, even if the files are older than 24 hours. Set to -1 if you wish to force a new scrape.
$load_all_ranks = 1; /* set this to 0 if you wish to stop scraping once the $test_website_url has been found in the search engine results,
* if set to 1 all $test_max_pages will be downloaded. This might be useful for more detailed ranking analysis.*/
$show_html = 0; // 1 means: output formated with HTML tags. 0 means output for console (recommended script usage)
$show_all_ranks = 1; // set to 1 to display a complete list of all ranks per keyword, set to 0 to only display the ranks for the specified website
// ***************************************************************************
$working_dir = "./local_cache"; // local directory. This script needs permissions to write into it
require "functions-ses.php";
$page = 0;
$PROXY = array(); // after the rotate api call this variable contains these elements: [address](proxy host),[port](proxy port),[external_ip](the external IP),[ready](0/1)
$PLAN = array();
$results = array();
if ($show_html) $NL = "<br>\n"; else $NL = "\n";
if ($show_html) $HR = "<hr>\n"; else $HR = "_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_\n";
if ($show_html) $B = "<b>"; else $B = "!";
if ($show_html) $B_ = "</b>"; else $B_ = "!";
/*
* Start of main()
*/
if ($show_html)
{
echo "<html><body>";
}
$keywords = explode(",", $test_keywords);
if (!count($keywords)) die ("Error: no keywords defined.$NL");
if (!rmkdir($working_dir)) die("Failed to create/open $working_dir$NL");
$country_data = get_google_cc($test_country, $test_language);
if (!$country_data) die("Invalid country/language code specified.$NL");
$ready = get_license();
if (!$ready) die("The specified API key account for user $uid is not active or invalid. $NL");
if ($PLAN['protocol'] != "http") die("Wrong proxy protocol configured, switch to HTTP. $NL");
echo "$NL$B Search Engine Scraper for $test_website_url initated $B_ $NL$NL";
/*
* This loop iterates through all keyword combinations
*/
$ch = NULL;
$rotate_ip = 0; // variable that triggers an IP rotation (normally only during keyword changes)
$max_errors_total = 3; // abort script if there are 3 keywords that can not be scraped (something is going wrong and needs to be checked)
$rank_data = array();
$siterank_data = array();
$break=0; // variable used to cancel loop without losing ranking data
foreach ($keywords as $keyword)
{
$rank = 0;
$max_errors_page = 5; // abort script if there are 5 errors in a row, that should not happen
if ($test_max_pages <= 0) break;
$search_string = urlencode($keyword);
$rotate_ip = 1; // IP rotation for each new keyword
/*
* This loop iterates through all result pages for the given keyword
*/
for ($page = 0; $page < $test_max_pages; $page++)
{
$serp_data = load_cache($search_string, $page, $country_data, $force_cache); // load results from local cache if available for today
$maxpages = 0;
if (!$serp_data)
{
$ip_ready = check_ip_usage(); // test if ip has not been used within the critical time
while (!$ip_ready || $rotate_ip)
{
$ok = rotate_proxy(); // start/rotate to the IP that has not been started for the longest time, also tests if proxy connection is working
if ($ok != 1)
{
die ("Fatal error: proxy rotation failed:$NL $ok$NL");
}
$ip_ready = check_ip_usage(); // test if ip has not been used within the critical time
if (!$ip_ready)
{
die("ERROR: No fresh IPs left, try again later. $NL");
} else
{
$rotate_ip = 0; // ip rotated
break; // continue
}
}
delay_time(); // stop scraping based on the license size to spread scrapes best possible and avoid detection
global $scrape_result; // contains metainformation from the scrape_serp_google() function
$raw_data = scrape_google($search_string, $page, $country_data); // scrape html from search engine
if ($scrape_result != "SCRAPE_SUCCESS")
{
if ($max_errors_page--)
{
echo "There was an error scraping (Code: $scrape_result), trying again .. $NL";
$page--;
continue;
} else
{
$page--;
if ($max_errors_total--)
{
echo "Too many errors scraping keyword $search_string (at page $page). Skipping remaining pages of keyword $search_string .. $NL";
break;
} else
{
die ("ERROR: Max keyword errors reached, something is going wrong. $NL");
}
break;
}
}
mark_ip_usage(); // store IP usage, this is very important to avoid detection and gray/blacklistings
global $process_result; // contains metainformation from the process_raw() function
$serp_data = process_raw_v2($raw_data, $page); // process the html and put results into $serp_data
if (($process_result == "PROCESS_SUCCESS_MORE") || ($process_result == "PROCESS_SUCCESS_LAST"))
{
$result_count = count($serp_data);
$serp_data['page'] = $page;
if ($process_result != "PROCESS_SUCCESS_LAST")
{
$serp_data['lastpage'] = 1;
} else
{
$serp_data['lastpage'] = 0;
}
$serp_data['keyword'] = $keyword;
$serp_data['cc'] = $country_data['cc'];
$serp_data['lc'] = $country_data['lc'];
$serp_data['result_count'] = $result_count;
store_cache($serp_data, $search_string, $page, $country_data); // store results into local cache
}
if ($process_result != "PROCESS_SUCCESS_MORE")
{
$break=1;
//break;
} // last page
if (!$load_all_ranks)
{
for ($n = 0; $n < $result_count; $n++)
if (strstr($results[$n]['url'], $test_website_url))
{
verbose("Located $test_website_url within search results.$NL");
$break=1;
//break;
}
}
} // scrape clause
$result_count = $serp_data['result_count'];
for ($ref = 0; $ref < $result_count; $ref++)
{
$rank++;
$rank_data[$keyword][$rank]['title'] = $serp_data[$ref]['title'];
$rank_data[$keyword][$rank]['url'] = $serp_data[$ref]['url'];
$rank_data[$keyword][$rank]['host'] = $serp_data[$ref]['host'];
$rank_data[$keyword][$rank]['desc'] = $serp_data[$ref]['desc'];
$rank_data[$keyword][$rank]['type'] = $serp_data[$ref]['type'];
//$rank_data[$keyword][$rank]['desc']=$serp_data['desc'']; // not really required
if (strstr($rank_data[$keyword][$rank]['url'], $test_website_url))
{
$info = array();
$info['rank'] = $rank;
$info['url'] = $rank_data[$keyword][$rank]['url'];
$siterank_data[$keyword][] = $info;
}
}
if ($break == 1) break;
} // page loop
} // keyword loop
if ($show_all_ranks)
{
foreach ($rank_data as $keyword => $ranks)
{
echo "$NL$NL$B" . "Ranking information for keyword \"$keyword\" $B_$NL";
echo "$B" . "Rank [Type] - Website - Title$B_$NL";
$pos = 0;
foreach ($ranks as $rank)
{
$pos++;
if (strstr($rank['url'], $test_website_url))
{
echo "$B$pos [$rank[type]] - $rank[url] - $rank[title] $B_$NL";
// echo $rank['desc']."\n";
} else
{
echo "$pos [$rank[type]] - $rank[url] - $rank[title] $NL";
// echo $rank['desc']."\n";
}
}
}
}
foreach ($keywords as $keyword)
{
if (!isset($siterank_data[$keyword]))
{
echo "$NL$B" . "The specified site was not found in the search results for keyword \"$keyword\". $B_$NL";
} else
{
$siteranks = $siterank_data[$keyword];
echo "$NL$NL$B" . "Ranking information for keyword \"$keyword\" and website \"$test_website_url\" [$test_country / $test_language] $B_$NL";
foreach ($siteranks as $siterank)
echo "Rank $siterank[rank] for URL $siterank[url]$NL";
}
}
//var_dump($siterank_data);
if ($show_html)
{
echo "</body></html>";
}
?>
<?PHP
/* License:
Open source for private and commercial use but this comment needs to stay untouched on top.
URL of original source code: http://scraping.compunect.com
Author of original source code: http://www.compunect.com
IP rotation API code from here: http://www.us-proxies.com/automate
Under no circumstances and under no legal theory, whether in tort (including negligence), contract, or otherwise, shall the Licensor be liable to anyone for any direct, indirect, special, incidental, or consequential damages of any character arising as a result of this License or the use of the Original Work including, without limitation, damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses. This limitation of liability shall not apply to the extent applicable law prohibits such limitation.
Usage exceptions:
Public redistributing modifications of this source code project is not allowed without written agreement.
Using this work for private and commercial projects is allowed, redistributing it is not allowed without our written agreement.
*/
function verbose($text)
{
echo $text;
}
/*
* By default (no force) the function will load cached data within 24 hours otherwise reject the cache.
* Google does not change its ranking too frequently, that's why 24 hours has been chosen.
*
* Multithreading: When multithreading you need to work on a proper locking mechanism
*/
function load_cache($search_string, $page, $country_data, $force_cache)
{
global $working_dir;
global $NL;
global $test_100_resultpage;
if ($force_cache < 0) return NULL;
$lc = $country_data['lc'];
$cc = $country_data['cc'];
if ($test_100_resultpage)
{
$hash = md5($search_string . "_" . $lc . "_" . $cc . "." . $page . ".100p");
} else
{
$hash = md5($search_string . "_" . $lc . "_" . $cc . "." . $page);
}
$file = "$working_dir/$hash.cache";
$now = time();
if (file_exists($file))
{
$ut = filemtime($file);
$dif = $now - $ut;
$hour = (int)($dif / (60 * 60));
if ($force_cache || ($dif < (60 * 60 * 24)))
{
$serdata = file_get_contents($file);
$serp_data = unserialize($serdata);
verbose("Cache: loaded file $file for $search_string and page $page. File age: $hour hours$NL");
return $serp_data;
}
return NULL;
} else
{
return NULL;
}
}
/*
* Multithreading: When multithreading you need to work on a proper locking mechanism
*/
function store_cache($serp_data, $search_string, $page, $country_data)
{
global $working_dir;
global $NL;
global $test_100_resultpage;
$lc = $country_data['lc'];
$cc = $country_data['cc'];
if ($test_100_resultpage)
{
$hash = md5($search_string . "_" . $lc . "_" . $cc . "." . $page . ".100p");
} else
{
$hash = md5($search_string . "_" . $lc . "_" . $cc . "." . $page);
}
$file = "$working_dir/$hash.cache";
$now = time();
if (file_exists($file))
{
$ut = filemtime($file);
$dif = $now - $ut;
if ($dif < (60 * 60 * 24)) echo "Warning: cache storage initated for $search_string page $page which was already cached within the past 24 hours!$NL";
}
$serdata = serialize($serp_data);
file_put_contents($file, $serdata, LOCK_EX);
verbose("Cache: stored file $file for $search_string and page $page.$NL");
}
// check_ip_usage() must be called before first use of mark_ip_usage()
function check_ip_usage()
{
global $PROXY;
global $working_dir;
global $NL;
global $ip_usage_data; // usage data object as array
if (!isset($PROXY['ready'])) return 0; // proxy not ready/started
if (!$PROXY['ready']) return 0; // proxy not ready/started
if (!isset($ip_usage_data))
{
if (!file_exists($working_dir . "/ipdata.obj")) // usage data object as file
{
echo "Warning!$NL" . "The ipdata.obj file was not found, if this is the first usage of the rank checker everything is alright.$NL" . "Otherwise removal or failure to access the ip usage data will lead to damage of the IP quality.$NL$NL";
sleep(5);
$ip_usage_data = array();
} else
{
$ser_data = file_get_contents($working_dir . "/ipdata.obj");
$ip_usage_data = unserialize($ser_data);
}
}
if (!isset($ip_usage_data[$PROXY['external_ip']]))
{
verbose("IP $PROXY[external_ip] is ready for use $NL");
return 1; // the IP was not used yet
}
if (!isset($ip_usage_data[$PROXY['external_ip']]['requests'][20]['ut_google']))
{
verbose("IP $PROXY[external_ip] is ready for use $NL");
return 1; // the IP has not been used 20+ times yet, return true
}
$ut_last = (int)$ip_usage_data[$PROXY['external_ip']]['ut_last-usage']; // last time this IP was used
$req_total = (int)$ip_usage_data[$PROXY['external_ip']]['request-total']; // total number of requests made by this IP
$req_20 = (int)$ip_usage_data[$PROXY['external_ip']]['requests'][10]['ut_google']; // the 20th request (if IP was used 20+ times) unixtime stamp [changed to 10 due to Google issues]
$now = time();
if (($now - $req_20) > (60 * 60))
{
verbose("IP $PROXY[external_ip] is ready for use $NL");
return 1; // more than an hour passed since 20th usage of this IP [changed to 10]
} else
{
$cd_sec = (60 * 60) - ($now - $req_20);
verbose("IP $PROXY[external_ip] needs $cd_sec seconds cooldown, not ready for use yet $NL");
return 0; // the IP is overused, it can not be used for scraping without being detected by the search engine yet
}
}
// return 1 if license is ready, otherwise 0
function get_license()
{
global $uid;
global $pwd;
global $PLAN;
global $NL;
$res = ip_service("plan");
$ip = "";
if ($res <= 0)
{
verbose("API error: Proxy API connection failed (Error $res). trying again later..$NL$NL");
return 0;
} else
{
($PLAN['active'] == 1) ? $ready = "active" : $ready = "not active";
verbose("API success: Account is $ready.$NL");
if ($PLAN['active'] == 1) return 1;
return 0;
}
return $PLAN;
}
/* Delay (sleep) based on the license size to allow optimal scraping
*
* Warning!
* Do NOT change the delay to be shorter than the specified delay.
* When scraping Google you should never do more than 20 requests per hour per IP address
* The recommended value is 10, if you must go higher you can go up to 20 but I'd stay lower
* This function will create a delay based on your total IP addresses.
*
* Together with the IP management functions this will ensure that your IPs stay healthy (no wrong rankings) and undetected (no virus warnings, blacklists, captchas)
*
* Multithreading:
* When multithreading you need to multiply the delay time ($d) by the number of threads
*
* Due to Google getting stricter and stricter you might even have to lower the rate.
*/
function delay_time()
{
global $NL;
global $PLAN;
$d = (3600 * 1000000 / (((float)$PLAN['total_ips']) * 10));
verbose("Delay based on plan size.. $NL");
usleep($d);
}
/*
* Updates and stores the ip usage data object
* Marks an IP as used and re-sorts the access array
*/
function mark_ip_usage()
{
global $PROXY;
global $working_dir;
global $NL;
global $ip_usage_data; // usage data object as array
if (!isset($ip_usage_data)) die("ERROR: Incorrect usage. check_ip_usage() needs to be called once before mark_ip_usage()!$NL");
$now = time();
$ip_usage_data[$PROXY['external_ip']]['ut_last-usage'] = $now; // last time this IP was used
if (!isset($ip_usage_data[$PROXY['external_ip']]['request-total'])) $ip_usage_data[$PROXY['external_ip']]['request-total'] = 0;
$ip_usage_data[$PROXY['external_ip']]['request-total']++; // total number of requests made by this IP
// shift fifo queue
for ($req = 19; $req >= 1; $req--)
{
if (isset($ip_usage_data[$PROXY['external_ip']]['requests'][$req]['ut_google']))
{
$ip_usage_data[$PROXY['external_ip']]['requests'][$req + 1]['ut_google'] = $ip_usage_data[$PROXY['external_ip']]['requests'][$req]['ut_google'];
}
}
$ip_usage_data[$PROXY['external_ip']]['requests'][1]['ut_google'] = $now;
$serdata = serialize($ip_usage_data);
file_put_contents($working_dir . "/ipdata.obj", $serdata, LOCK_EX);
}
// access google based on parameters and return raw html or "0" in case of an error
function scrape_google($search_string, $page, $local_data)
{
global $ch;
global $NL;
global $PROXY;
global $PLAN;
global $scrape_result;
global $test_100_resultpage;
global $filter;
$scrape_result = "";
$google_ip = $local_data['domain'];
$hl = $local_data['lc'];
if ($page == 0)
{
if ($test_100_resultpage)
{
$url = "http://$google_ip/search?q=$search_string&hl=$hl&ie=utf-8&as_qdr=all&aq=t&rls=org:mozilla:us:official&client=firefox&num=100&filter=$filter";
} else
{
$url = "http://$google_ip/search?q=$search_string&hl=$hl&ie=utf-8&as_qdr=all&aq=t&rls=org:mozilla:us:official&client=firefox&num=10&filter=$filter";
}
} else
{
if ($test_100_resultpage)
{
$num = $page * 100;
$url = "http://$google_ip/search?q=$search_string&hl=$hl&ie=utf-8&as_qdr=all&aq=t&rls=org:mozilla:us:official&client=firefox&start=$num&num=100&filter=$filter";
} else
{
$num = $page * 10;
$url = "http://$google_ip/search?q=$search_string&hl=$hl&ie=utf-8&as_qdr=all&aq=t&rls=org:mozilla:us:official&client=firefox&start=$num&num=10&filter=$filter";
}
}
//verbose("Debug, Search URL: $url$NL");
curl_setopt($ch, CURLOPT_URL, $url);
$htmdata = curl_exec($ch);
if (!$htmdata)
{
$error = curl_error($ch);
$info = curl_getinfo($ch);
echo "\tError scraping: $error [ $error ]$NL";
$scrape_result = "SCRAPE_ERROR";
sleep(3);
return "";
} else
{
if (strlen($htmdata) < 20)
{
$scrape_result = "SCRAPE_EMPTY_SERP";
sleep(3);
return "";
}
}
if (strstr($htmdata, "computer virus or spyware application"))
{
echo("Google blocked us, we need more proxies ! Make sure you did not damage the IP management functions. Consider changing keywords and lowering request rates. $NL");
$scrape_result = "SCRAPE_DETECTED";
die();
}
if (strstr($htmdata, "entire network is affected"))
{
echo("Google blocked us, we need more proxies ! Make sure you did not damage the IP management functions. Consider changing keywords and lowering request rates. $NL");
$scrape_result = "SCRAPE_DETECTED";
die();
}
if (strstr($htmdata, "http://www.download.com/Antivirus"))
{
echo("Google blocked us, we need more proxies ! Make sure you did not damage the IP management functions. Consider changing keywords and lowering request rates. $NL");
$scrape_result = "SCRAPE_DETECTED";
die();
}
if (strstr($htmdata, "/images/yellow_warning.gif"))
{
echo("Google blocked us, we need more proxies ! Make sure you did not damage the IP management functions. Consider changing keywords and lowering request rates. $NL");
$scrape_result = "SCRAPE_DETECTED";
die();
}
if (strstr($htmdata, "This page appears when Google automatically detects requests coming from your computer network"))
{
echo("Google blocked us, we need more proxies ! Make sure you did not damage the IP management functions. Consider changing keywords and lowering request rates. $NL");
$scrape_result = "SCRAPE_DETECTED";
die();
}
$scrape_result = "SCRAPE_SUCCESS";
return $htmdata;
}
require_once "simple_html_dom.php";
function process_raw_v2($data, $page)
{
global $process_result; // contains metainformation from the process_raw() function
global $test_100_resultpage;
global $NL;
global $B;
global $B_;
$results=array();
$html = new simple_html_dom();
$html->load($data);
/** @var $interest simple_html_dom_node */
$interest = $html->find('div#ires ol div.g');
echo "found interesting elements: ".count($interest)."\n";
$interest_num=0;
foreach ($interest as $li)
{
$result = array('title'=>'undefined','host'=>'undefined','url'=>'undefined','desc'=>'undefined','type'=>'organic');
$interest_num ++;
$h3 = $li->find('h3.r',0);
if (!$h3)
{
continue;
}
$a = $h3->find('a',0);
if (!$a) continue;
$result['title'] = html_entity_decode($a->plaintext);
$lnk = urldecode($a->href);
if ($lnk)
{
preg_match('/(ht[^&]*)/', $lnk, $m);
if ($m && $m[1])
{
$result['url']=$m[1];
$tmp=parse_url($m[1]);
$result['host']=$tmp['host'];
} else
{
if (strstr($result['title'],'News')) $result['type']='news';
if (strstr($result['title'],'Images')) $result['type']='images';
}
}
if ($result['type']=='organic')
{
$sp = $li->find('span.st',0);
if ($sp)
{
$result['desc']=html_entity_decode($sp->plaintext);
$sp->clear();
}
}
$h3->clear();
$a->clear();
$li->clear();
$results[]=$result;
}
$html->clear;
// Analyze if more results are available (next page)
$next = 0;
if (strstr($data, "Next</a>"))
{
$next = 1;
} else
{
if ($test_100_resultpage)
{
$needstart = ($page + 1) * 100;
} else
{
$needstart = ($page + 1) * 10;
}
$findstr = "start=$needstart";
if (strstr($data, $findstr)) $next = 1;
}
$page++;
if ($next)
{
$process_result = "PROCESS_SUCCESS_MORE"; // more data available
} else
{
$process_result = "PROCESS_SUCCESS_LAST";
} // last page reached
return $results;
}
function rotate_proxy()
{
global $PROXY;
global $ch;
global $NL;
$max_errors = 3;
$success = 0;
while ($max_errors--)
{
$res = ip_service("rotate"); // will fill $PROXY
$ip = "";
if ($res <= 0)
{
verbose("API error: Proxy API connection failed (Error $res). trying again soon..$NL$NL");
sleep(21); // retry after a while
} else
{
verbose("API success: Received proxy IP $PROXY[external_ip] on port $PROXY[port]$NL");
$success = 1;
break;
}
}
if ($success)
{
$ch = new_curl_session($ch);
return 1;
} else
{
return "API rotation failed. Check license, firewall and API credentials.$NL";
}
}
function extractBody($response_str)
{
$parts = preg_split('|(?:\r?\n){2}|m', $response_str, 2);
if (isset($parts[1])) return $parts[1];
return '';
}
/*
* This is the API function to retrieve US IP addresses
* On success this function will define the global $PROXY variable, adding the elements ready,address,port,external_ip and return 1
* On failure the return is 0 or smaller and the PROXY variable ready element is set to "0"
* To obtain a plan please check out us-proxies.com, this can often be handled within a day
*/
function ip_service($cmd, $x = "")
{
global $pwd;
global $uid;
global $PROXY;
global $PLAN;
global $NL;
$fp = fsockopen("us-proxies.com", 80);
if (!$fp)
{
echo "Unable to connect to API $NL";
return -1; // connection not possible
} else
{
if ($cmd == "plan")
{
fwrite($fp, "GET /api.php?api=1&uid=$uid&pwd=$pwd&cmd=plan&extended=1 HTTP/1.0\r\nHost: us-proxies.com\r\nAccept: text/html, text/plain, text/*, */*;q=0.01\r\nAccept-Encoding: plain\r\nAccept-Language: en\r\n\r\n");
stream_set_timeout($fp, 8);
$res = "";
$n = 0;
while (!feof($fp))
{
if ($n++ > 4) break;
$res .= fread($fp, 8192);
}
$info = stream_get_meta_data($fp);
fclose($fp);
if ($info['timed_out'])
{
echo 'API: Connection timed out! $NL';
$PLAN['active'] = 0;
return -2; // api timeout
} else
{
if (strlen($res) > 1000) return -3; // invalid api response (check the API website for possible problems)
$data = extractBody($res);
$ar = explode(":", $data);
if (count($ar) < 4) return -100; // invalid api response
switch ($ar[0])
{
case "ERROR":
echo "API Error: $res $NL";
$PLAN['active'] = 0;
return 0; // Error received
break;
case "PLAN":
$PLAN['max_ips'] = $ar[1]; // number of IPs licensed
$PLAN['total_ips'] = $ar[2]; // number of IPs assigned
$PLAN['protocol'] = $ar[3]; // current proxy protocol (http, socks, ..)
$PLAN['processes'] = $ar[4]; // number of available proxy processes
if ($PLAN['total_ips'] > 0) $PLAN['active'] = 1; else $PLAN['active'] = 0;
return 1;
break;
default:
echo "API Error: Received answer $ar[0], expected \"PLAN\"";
$PLAN['active'] = 0;
return -101; // unknown API response
}
}
} // cmd==plan
if ($cmd == "rotate")
{
$PROXY['ready'] = 0;
fwrite($fp, "GET /api.php?api=1&uid=$uid&pwd=$pwd&cmd=rotate&randomness=0&offset=0 HTTP/1.0\r\nHost: us-proxies.com\r\nAccept: text/html, text/plain, text/*, */*;q=0.01\r\nAccept-Encoding: plain\r\nAccept-Language: en\r\n\r\n");
stream_set_timeout($fp, 8);
$res = "";
$n = 0;
while (!feof($fp))
{
if ($n++ > 4) break;
$res .= fread($fp, 8192);
}
$info = stream_get_meta_data($fp);
fclose($fp);
if ($info['timed_out'])
{
echo 'API: Connection timed out! $NL';
return -2; // api timeout
} else
{
if (strlen($res) > 1000) return -3; // invalid api response (check the API website for possible problems)
$data = extractBody($res);
$ar = explode(":", $data);
if (count($ar) < 4) return -100; // invalid api response
switch ($ar[0])
{
case "ERROR":
echo "API Error: $res $NL";
return 0; // Error received
break;
case "ROTATE":
$PROXY['address'] = $ar[1];
$PROXY['port'] = $ar[2];
$PROXY['external_ip'] = $ar[3];
$PROXY['ready'] = 1;
usleep(230000); // additional time to avoid connecting during proxy bootup phase, removing this can cause random connection failures but will increase overall performance for large IP licenses
return 1;
break;
default:
echo "API Error: Received answer $ar[0], expected \"ROTATE\"";
return -101; // unknown API response
}
}
} // cmd==rotate
}
}
function getip()
{
global $PROXY;
if (!$PROXY['ready']) return -1; // proxy not ready
$curl_handle = curl_init();
curl_setopt($curl_handle, CURLOPT_URL, 'http://ipcheck.ipnetic.com/remote_ip.php'); // returns the real IP
curl_setopt($curl_handle, CURLOPT_CONNECTTIMEOUT, 10);
curl_setopt($curl_handle, CURLOPT_TIMEOUT, 10);
curl_setopt($curl_handle, CURLOPT_RETURNTRANSFER, 1);
$curl_proxy = "$PROXY[address]:$PROXY[port]";
curl_setopt($curl_handle, CURLOPT_PROXY, $curl_proxy);
$tested_ip = curl_exec($curl_handle);
if (preg_match("^([1-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(\.([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])){3}^", $tested_ip))
{
curl_close($curl_handle);
return $tested_ip;
} else
{
$info = curl_getinfo($curl_handle);
curl_close($curl_handle);
return 0; // possible error would be a wrong authentication IP or a firewall
}
}
function new_curl_session($ch = NULL)
{
global $PROXY;
if ((!isset($PROXY['ready'])) || (!$PROXY['ready'])) return $ch; // proxy not ready
if (isset($ch) && ($ch != NULL))
{
curl_close($ch);
}
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$curl_proxy = "$PROXY[address]:$PROXY[port]";
curl_setopt($ch, CURLOPT_PROXY, $curl_proxy);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 20);
curl_setopt($ch, CURLOPT_TIMEOUT, 20);
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.0; en; rv:1.9.0.4) Gecko/2009011913 Firefox/3.0.6");
return $ch;
}
function rmkdir($path, $mode = 0755)
{
if (file_exists($path)) return 1;
return @mkdir($path, $mode);
}
/*
* For country&language specific searches
* The identifier codes require an active plan at us-proxies.com
* If you plan to omit the IP service just replace that part too or do not use language specifications at all
*/
function get_google_cc($cc, $lc)
{
global $pwd;
global $uid;
global $PROXY;
global $PLAN;
global $NL;
$fp = fsockopen("us-proxies.com", 80);
if (!$fp)
{
echo "Unable to connect to google_cc API of us-proxies.com $NL";
return NULL; // connection not possible
} else
{
// echo("GET /g_api.php?api=1&uid=$uid&pwd=$pwd&cmd=google_cc&cc=$cc&lc=$lc HTTP/1.0\r\nHost: us-proxies.com\r\nAccept: text/html, text/plain, text/*, */*;q=0.01\r\nAccept-Encoding: plain\r\nAccept-Language: en\r\n\r\n");
fwrite($fp, "GET /g_api.php?api=1&uid=$uid&pwd=$pwd&cmd=google_cc&cc=$cc&lc=$lc HTTP/1.0\r\nHost: us-proxies.com\r\nAccept: text/html, text/plain, text/*, */*;q=0.01\r\nAccept-Encoding: plain\r\nAccept-Language: en\r\n\r\n");
stream_set_timeout($fp, 8);
$res = "";
$n = 0;
while (!feof($fp))
{
if ($n++ > 4) break;
$res .= fread($fp, 8192);
}
$info = stream_get_meta_data($fp);
fclose($fp);
if ($info['timed_out'])
{
echo 'API: Connection timed out! $NL';
return NULL; // api timeout
} else
{
$data = extractBody($res);
$obj = unserialize($data);
if (isset($obj['error'])) echo $obj['error'] . "$NL";
if (isset($obj['info'])) echo $obj['info'] . "$NL";
return $obj['data'];
if (strlen($data) < 4) return NULL; // invalid api response
}
}
}
?>
<?php
/**
* Website: http://sourceforge.net/projects/simplehtmldom/
* Acknowledge: Jose Solorzano (https://sourceforge.net/projects/php-html/)
* Contributions by:
* Yousuke Kumakura (Attribute filters)
* Vadim Voituk (Negative indexes supports of "find" method)
* Antcs (Constructor with automatically load contents either text or file/url)
*
* all affected sections have comments starting with "PaperG"
*
* Paperg - Added case insensitive testing of the value of the selector.
* Paperg - Added tag_start for the starting index of tags - NOTE: This works but not accurately.
* This tag_start gets counted AFTER \r\n have been crushed out, and after the remove_noice calls so it will not reflect the REAL position of the tag in the source,
* it will almost always be smaller by some amount.
* We use this to determine how far into the file the tag in question is. This "percentage will never be accurate as the $dom->size is the "real" number of bytes the dom was created from.
* but for most purposes, it's a really good estimation.
* Paperg - Added the forceTagsClosed to the dom constructor. Forcing tags closed is great for malformed html, but it CAN lead to parsing errors.
* Allow the user to tell us how much they trust the html.
* Paperg add the text and plaintext to the selectors for the find syntax. plaintext implies text in the innertext of a node. text implies that the tag is a text node.
* This allows for us to find tags based on the text they contain.
* Create find_ancestor_tag to see if a tag is - at any level - inside of another specific tag.
* Paperg: added parse_charset so that we know about the character set of the source document.
* NOTE: If the user's system has a routine called get_last_retrieve_url_contents_content_type availalbe, we will assume it's returning the content-type header from the
* last transfer or curl_exec, and we will parse that and use it in preference to any other method of charset detection.
*
* Found infinite loop in the case of broken html in restore_noise. Rewrote to protect from that.
* PaperG (John Schlick) Added get_display_size for "IMG" tags.
*
* Licensed under The MIT License
* Redistributions of files must retain the above copyright notice.
*
* @author S.C. Chen <me578022@gmail.com>
* @author John Schlick
* @author Rus Carroll
* @version 1.5 ($Rev: 196 $)
* @package PlaceLocalInclude
* @subpackage simple_html_dom
*/
/**
* All of the Defines for the classes below.
* @author S.C. Chen <me578022@gmail.com>
*/
define('HDOM_TYPE_ELEMENT', 1);
define('HDOM_TYPE_COMMENT', 2);
define('HDOM_TYPE_TEXT', 3);
define('HDOM_TYPE_ENDTAG', 4);
define('HDOM_TYPE_ROOT', 5);
define('HDOM_TYPE_UNKNOWN', 6);
define('HDOM_QUOTE_DOUBLE', 0);
define('HDOM_QUOTE_SINGLE', 1);
define('HDOM_QUOTE_NO', 3);
define('HDOM_INFO_BEGIN', 0);
define('HDOM_INFO_END', 1);
define('HDOM_INFO_QUOTE', 2);
define('HDOM_INFO_SPACE', 3);
define('HDOM_INFO_TEXT', 4);
define('HDOM_INFO_INNER', 5);
define('HDOM_INFO_OUTER', 6);
define('HDOM_INFO_ENDSPACE',7);
define('DEFAULT_TARGET_CHARSET', 'UTF-8');
define('DEFAULT_BR_TEXT', "\r\n");
define('DEFAULT_SPAN_TEXT', " ");
define('MAX_FILE_SIZE', 600000);
// helper functions
// -----------------------------------------------------------------------------
// get html dom from file
// $maxlen is defined in the code as PHP_STREAM_COPY_ALL which is defined as -1.
function file_get_html($url, $use_include_path = false, $context=null, $offset = -1, $maxLen=-1, $lowercase = true, $forceTagsClosed=true, $target_charset = DEFAULT_TARGET_CHARSET, $stripRN=true, $defaultBRText=DEFAULT_BR_TEXT, $defaultSpanText=DEFAULT_SPAN_TEXT)
{
// We DO force the tags to be terminated.
$dom = new simple_html_dom(null, $lowercase, $forceTagsClosed, $target_charset, $stripRN, $defaultBRText, $defaultSpanText);
// For sourceforge users: uncomment the next line and comment the retreive_url_contents line 2 lines down if it is not already done.
$contents = file_get_contents($url, $use_include_path, $context, $offset);
// Paperg - use our own mechanism for getting the contents as we want to control the timeout.
//$contents = retrieve_url_contents($url);
if (empty($contents) || strlen($contents) > MAX_FILE_SIZE)
{
return false;
}
// The second parameter can force the selectors to all be lowercase.
$dom->load($contents, $lowercase, $stripRN);
return $dom;
}
// get html dom from string
function str_get_html($str, $lowercase=true, $forceTagsClosed=true, $target_charset = DEFAULT_TARGET_CHARSET, $stripRN=true, $defaultBRText=DEFAULT_BR_TEXT, $defaultSpanText=DEFAULT_SPAN_TEXT)
{
$dom = new simple_html_dom(null, $lowercase, $forceTagsClosed, $target_charset, $stripRN, $defaultBRText, $defaultSpanText);
if (empty($str) || strlen($str) > MAX_FILE_SIZE)
{
$dom->clear();
return false;
}
$dom->load($str, $lowercase, $stripRN);
return $dom;
}
// dump html dom tree
function dump_html_tree($node, $show_attr=true, $deep=0)
{
$node->dump($node);
}
/**
* simple html dom node
* PaperG - added ability for "find" routine to lowercase the value of the selector.
* PaperG - added $tag_start to track the start position of the tag in the total byte index
*
* @package PlaceLocalInclude
*/
class simple_html_dom_node
{
public $nodetype = HDOM_TYPE_TEXT;
public $tag = 'text';
public $attr = array();
public $children = array();
public $nodes = array();
public $parent = null;
// The "info" array - see HDOM_INFO_... for what each element contains.
public $_ = array();
public $tag_start = 0;
private $dom = null;
function __construct($dom)
{
$this->dom = $dom;
$dom->nodes[] = $this;
}
function __destruct()
{
$this->clear();
}
function __toString()
{
return $this->outertext();
}
// clean up memory due to php5 circular references memory leak...
function clear()
{
$this->dom = null;
$this->nodes = null;
$this->parent = null;
$this->children = null;
}
// dump node's tree
function dump($show_attr=true, $deep=0)
{
$lead = str_repeat(' ', $deep);
echo $lead.$this->tag;
if ($show_attr && count($this->attr)>0)
{
echo '(';
foreach ($this->attr as $k=>$v)
echo "[$k]=>\"".$this->$k.'", ';
echo ')';
}
echo "\n";
if ($this->nodes)
{
foreach ($this->nodes as $c)
{
$c->dump($show_attr, $deep+1);
}
}
}
// Debugging function to dump a single dom node with a bunch of information about it.
function dump_node($echo=true)
{
$string = $this->tag;
if (count($this->attr)>0)
{
$string .= '(';
foreach ($this->attr as $k=>$v)
{
$string .= "[$k]=>\"".$this->$k.'", ';
}
$string .= ')';
}
if (count($this->_)>0)
{
$string .= ' $_ (';
foreach ($this->_ as $k=>$v)
{
if (is_array($v))
{
$string .= "[$k]=>(";
foreach ($v as $k2=>$v2)
{
$string .= "[$k2]=>\"".$v2.'", ';
}
$string .= ")";
} else {
$string .= "[$k]=>\"".$v.'", ';
}
}
$string .= ")";
}
if (isset($this->text))
{
$string .= " text: (" . $this->text . ")";
}
$string .= " HDOM_INNER_INFO: '";
if (isset($node->_[HDOM_INFO_INNER]))
{
$string .= $node->_[HDOM_INFO_INNER] . "'";
}
else
{
$string .= ' NULL ';
}
$string .= " children: " . count($this->children);
$string .= " nodes: " . count($this->nodes);
$string .= " tag_start: " . $this->tag_start;
$string .= "\n";
if ($echo)
{
echo $string;
return;
}
else
{
return $string;
}
}
// returns the parent of node
// If a node is passed in, it will reset the parent of the current node to that one.
function parent($parent=null)
{
// I am SURE that this doesn't work properly.
// It fails to unset the current node from it's current parents nodes or children list first.
if ($parent !== null)
{
$this->parent = $parent;
$this->parent->nodes[] = $this;
$this->parent->children[] = $this;
}
return $this->parent;
}
// verify that node has children
function has_child()
{
return !empty($this->children);
}
// returns children of node
function children($idx=-1)
{
if ($idx===-1)
{
return $this->children;
}
if (isset($this->children[$idx])) return $this->children[$idx];
return null;
}
// returns the first child of node
function first_child()
{
if (count($this->children)>0)
{
return $this->children[0];
}
return null;
}
// returns the last child of node
function last_child()
{
if (($count=count($this->children))>0)
{
return $this->children[$count-1];
}
return null;
}
// returns the next sibling of node
function next_sibling()
{
if ($this->parent===null)
{
return null;
}
$idx = 0;
$count = count($this->parent->children);
while ($idx<$count && $this!==$this->parent->children[$idx])
{
++$idx;
}
if (++$idx>=$count)
{
return null;
}
return $this->parent->children[$idx];
}
// returns the previous sibling of node
function prev_sibling()
{
if ($this->parent===null) return null;
$idx = 0;
$count = count($this->parent->children);
while ($idx<$count && $this!==$this->parent->children[$idx])
++$idx;
if (--$idx<0) return null;
return $this->parent->children[$idx];
}
// function to locate a specific ancestor tag in the path to the root.
function find_ancestor_tag($tag)
{
global $debugObject;
if (is_object($debugObject)) { $debugObject->debugLogEntry(1); }
// Start by including ourselves in the comparison.
$returnDom = $this;
while (!is_null($returnDom))
{
if (is_object($debugObject)) { $debugObject->debugLog(2, "Current tag is: " . $returnDom->tag); }
if ($returnDom->tag == $tag)
{
break;
}
$returnDom = $returnDom->parent;
}
return $returnDom;
}
// get dom node's inner html
function innertext()
{
if (isset($this->_[HDOM_INFO_INNER])) return $this->_[HDOM_INFO_INNER];
if (isset($this->_[HDOM_INFO_TEXT])) return $this->dom->restore_noise($this->_[HDOM_INFO_TEXT]);
$ret = '';
foreach ($this->nodes as $n)
$ret .= $n->outertext();
return $ret;
}
// get dom node's outer text (with tag)
function outertext()
{
global $debugObject;
if (is_object($debugObject))
{
$text = '';
if ($this->tag == 'text')
{
if (!empty($this->text))
{
$text = " with text: " . $this->text;
}
}
$debugObject->debugLog(1, 'Innertext of tag: ' . $this->tag . $text);
}
if ($this->tag==='root') return $this->innertext();
// trigger callback
if ($this->dom && $this->dom->callback!==null)
{
call_user_func_array($this->dom->callback, array($this));
}
if (isset($this->_[HDOM_INFO_OUTER])) return $this->_[HDOM_INFO_OUTER];
if (isset($this->_[HDOM_INFO_TEXT])) return $this->dom->restore_noise($this->_[HDOM_INFO_TEXT]);
// render begin tag
if ($this->dom && $this->dom->nodes[$this->_[HDOM_INFO_BEGIN]])
{
$ret = $this->dom->nodes[$this->_[HDOM_INFO_BEGIN]]->makeup();
} else {
$ret = "";
}
// render inner text
if (isset($this->_[HDOM_INFO_INNER]))
{
// If it's a br tag... don't return the HDOM_INNER_INFO that we may or may not have added.
if ($this->tag != "br")
{
$ret .= $this->_[HDOM_INFO_INNER];
}
} else {
if ($this->nodes)
{
foreach ($this->nodes as $n)
{
$ret .= $this->convert_text($n->outertext());
}
}
}
// render end tag
if (isset($this->_[HDOM_INFO_END]) && $this->_[HDOM_INFO_END]!=0)
$ret .= '</'.$this->tag.'>';
return $ret;
}
// get dom node's plain text
function text()
{
if (isset($this->_[HDOM_INFO_INNER])) return $this->_[HDOM_INFO_INNER];
switch ($this->nodetype)
{
case HDOM_TYPE_TEXT: return $this->dom->restore_noise($this->_[HDOM_INFO_TEXT]);
case HDOM_TYPE_COMMENT: return '';
case HDOM_TYPE_UNKNOWN: return '';
}
if (strcasecmp($this->tag, 'script')===0) return '';
if (strcasecmp($this->tag, 'style')===0) return '';
$ret = '';
// In rare cases, (always node type 1 or HDOM_TYPE_ELEMENT - observed for some span tags, and some p tags) $this->nodes is set to NULL.
// NOTE: This indicates that there is a problem where it's set to NULL without a clear happening.
// WHY is this happening?
if (!is_null($this->nodes))
{
foreach ($this->nodes as $n)
{
$ret .= $this->convert_text($n->text());
}
// If this node is a span... add a space at the end of it so multiple spans don't run into each other. This is plaintext after all.
if ($this->tag == "span")
{
$ret .= $this->dom->default_span_text;
}
}
return $ret;
}
function xmltext()
{
$ret = $this->innertext();
$ret = str_ireplace('<![CDATA[', '', $ret);
$ret = str_replace(']]>', '', $ret);
return $ret;
}
// build node's text with tag
function makeup()
{
// text, comment, unknown
if (isset($this->_[HDOM_INFO_TEXT])) return $this->dom->restore_noise($this->_[HDOM_INFO_TEXT]);
$ret = '<'.$this->tag;
$i = -1;
foreach ($this->attr as $key=>$val)
{
++$i;
// skip removed attribute
if ($val===null || $val===false)
continue;
$ret .= $this->_[HDOM_INFO_SPACE][$i][0];
//no value attr: nowrap, checked selected...
if ($val===true)
$ret .= $key;
else {
switch ($this->_[HDOM_INFO_QUOTE][$i])
{
case HDOM_QUOTE_DOUBLE: $quote = '"'; break;
case HDOM_QUOTE_SINGLE: $quote = '\''; break;
default: $quote = '';
}
$ret .= $key.$this->_[HDOM_INFO_SPACE][$i][1].'='.$this->_[HDOM_INFO_SPACE][$i][2].$quote.$val.$quote;
}
}
$ret = $this->dom->restore_noise($ret);
return $ret . $this->_[HDOM_INFO_ENDSPACE] . '>';
}
// find elements by css selector
//PaperG - added ability for find to lowercase the value of the selector.
function find($selector, $idx=null, $lowercase=false)
{
$selectors = $this->parse_selector($selector);
if (($count=count($selectors))===0) return array();
$found_keys = array();
// find each selector
for ($c=0; $c<$count; ++$c)
{
// The change on the below line was documented on the sourceforge code tracker id 2788009
// used to be: if (($levle=count($selectors[0]))===0) return array();
if (($levle=count($selectors[$c]))===0) return array();
if (!isset($this->_[HDOM_INFO_BEGIN])) return array();
$head = array($this->_[HDOM_INFO_BEGIN]=>1);
// handle descendant selectors, no recursive!
for ($l=0; $l<$levle; ++$l)
{
$ret = array();
foreach ($head as $k=>$v)
{
$n = ($k===-1) ? $this->dom->root : $this->dom->nodes[$k];
//PaperG - Pass this optional parameter on to the seek function.
$n->seek($selectors[$c][$l], $ret, $lowercase);
}
$head = $ret;
}
foreach ($head as $k=>$v)
{
if (!isset($found_keys[$k]))
$found_keys[$k] = 1;
}
}
// sort keys
ksort($found_keys);
$found = array();
foreach ($found_keys as $k=>$v)
$found[] = $this->dom->nodes[$k];
// return nth-element or array
if (is_null($idx)) return $found;
else if ($idx<0) $idx = count($found) + $idx;
return (isset($found[$idx])) ? $found[$idx] : null;
}
// seek for given conditions
// PaperG - added parameter to allow for case insensitive testing of the value of a selector.
protected function seek($selector, &$ret, $lowercase=false)
{
global $debugObject;
if (is_object($debugObject)) { $debugObject->debugLogEntry(1); }
list($tag, $key, $val, $exp, $no_key) = $selector;
// xpath index
if ($tag && $key && is_numeric($key))
{
$count = 0;
foreach ($this->children as $c)
{
if ($tag==='*' || $tag===$c->tag) {
if (++$count==$key) {
$ret[$c->_[HDOM_INFO_BEGIN]] = 1;
return;
}
}
}
return;
}
$end = (!empty($this->_[HDOM_INFO_END])) ? $this->_[HDOM_INFO_END] : 0;
if ($end==0) {
$parent = $this->parent;
while (!isset($parent->_[HDOM_INFO_END]) && $parent!==null) {
$end -= 1;
$parent = $parent->parent;
}
$end += $parent->_[HDOM_INFO_END];
}
for ($i=$this->_[HDOM_INFO_BEGIN]+1; $i<$end; ++$i) {
$node = $this->dom->nodes[$i];
$pass = true;
if ($tag==='*' && !$key) {
if (in_array($node, $this->children, true))
$ret[$i] = 1;
continue;
}
// compare tag
if ($tag && $tag!=$node->tag && $tag!=='*') {$pass=false;}
// compare key
if ($pass && $key) {
if ($no_key) {
if (isset($node->attr[$key])) $pass=false;
} else {
if (($key != "plaintext") && !isset($node->attr[$key])) $pass=false;
}
}
// compare value
if ($pass && $key && $val && $val!=='*') {
// If they have told us that this is a "plaintext" search then we want the plaintext of the node - right?
if ($key == "plaintext") {
// $node->plaintext actually returns $node->text();
$nodeKeyValue = $node->text();
} else {
// this is a normal search, we want the value of that attribute of the tag.
$nodeKeyValue = $node->attr[$key];
}
if (is_object($debugObject)) {$debugObject->debugLog(2, "testing node: " . $node->tag . " for attribute: " . $key . $exp . $val . " where nodes value is: " . $nodeKeyValue);}
//PaperG - If lowercase is set, do a case insensitive test of the value of the selector.
if ($lowercase) {
$check = $this->match($exp, strtolower($val), strtolower($nodeKeyValue));
} else {
$check = $this->match($exp, $val, $nodeKeyValue);
}
if (is_object($debugObject)) {$debugObject->debugLog(2, "after match: " . ($check ? "true" : "false"));}
// handle multiple class
if (!$check && strcasecmp($key, 'class')===0) {
foreach (explode(' ',$node->attr[$key]) as $k) {
// Without this, there were cases where leading, trailing, or double spaces lead to our comparing blanks - bad form.
if (!empty($k)) {
if ($lowercase) {
$check = $this->match($exp, strtolower($val), strtolower($k));
} else {
$check = $this->match($exp, $val, $k);
}
if ($check) break;
}
}
}
if (!$check) $pass = false;
}
if ($pass) $ret[$i] = 1;
unset($node);
}
// It's passed by reference so this is actually what this function returns.
if (is_object($debugObject)) {$debugObject->debugLog(1, "EXIT - ret: ", $ret);}
}
protected function match($exp, $pattern, $value) {
global $debugObject;
if (is_object($debugObject)) {$debugObject->debugLogEntry(1);}
switch ($exp) {
case '=':
return ($value===$pattern);
case '!=':
return ($value!==$pattern);
case '^=':
return preg_match("/^".preg_quote($pattern,'/')."/", $value);
case '$=':
return preg_match("/".preg_quote($pattern,'/')."$/", $value);
case '*=':
if ($pattern[0]=='/') {
return preg_match($pattern, $value);
}
return preg_match("/".$pattern."/i", $value);
}
return false;
}
protected function parse_selector($selector_string) {
global $debugObject;
if (is_object($debugObject)) {$debugObject->debugLogEntry(1);}
// pattern of CSS selectors, modified from mootools
// Paperg: Add the colon to the attrbute, so that it properly finds <tag attr:ibute="something" > like google does.
// Note: if you try to look at this attribute, yo MUST use getAttribute since $dom->x:y will fail the php syntax check.
// Notice the \[ starting the attbute? and the @? following? This implies that an attribute can begin with an @ sign that is not captured.
// This implies that an html attribute specifier may start with an @ sign that is NOT captured by the expression.
// farther study is required to determine of this should be documented or removed.
// $pattern = "/([\w-:\*]*)(?:\#([\w-]+)|\.([\w-]+))?(?:\[@?(!?[\w-]+)(?:([!*^$]?=)[\"']?(.*?)[\"']?)?\])?([\/, ]+)/is";
$pattern = "/([\w-:\*]*)(?:\#([\w-]+)|\.([\w-]+))?(?:\[@?(!?[\w-:]+)(?:([!*^$]?=)[\"']?(.*?)[\"']?)?\])?([\/, ]+)/is";
preg_match_all($pattern, trim($selector_string).' ', $matches, PREG_SET_ORDER);
if (is_object($debugObject)) {$debugObject->debugLog(2, "Matches Array: ", $matches);}
$selectors = array();
$result = array();
//print_r($matches);
foreach ($matches as $m) {
$m[0] = trim($m[0]);
if ($m[0]==='' || $m[0]==='/' || $m[0]==='//') continue;
// for browser generated xpath
if ($m[1]==='tbody') continue;
list($tag, $key, $val, $exp, $no_key) = array($m[1], null, null, '=', false);
if (!empty($m[2])) {$key='id'; $val=$m[2];}
if (!empty($m[3])) {$key='class'; $val=$m[3];}
if (!empty($m[4])) {$key=$m[4];}
if (!empty($m[5])) {$exp=$m[5];}
if (!empty($m[6])) {$val=$m[6];}
// convert to lowercase
if ($this->dom->lowercase) {$tag=strtolower($tag); $key=strtolower($key);}
//elements that do NOT have the specified attribute
if (isset($key[0]) && $key[0]==='!') {$key=substr($key, 1); $no_key=true;}
$result[] = array($tag, $key, $val, $exp, $no_key);
if (trim($m[7])===',') {
$selectors[] = $result;
$result = array();
}
}
if (count($result)>0)
$selectors[] = $result;
return $selectors;
}
function __get($name) {
if (isset($this->attr[$name]))
{
return $this->convert_text($this->attr[$name]);
}
switch ($name) {
case 'outertext': return $this->outertext();
case 'innertext': return $this->innertext();
case 'plaintext': return $this->text();
case 'xmltext': return $this->xmltext();
default: return array_key_exists($name, $this->attr);
}
}
function __set($name, $value) {
switch ($name) {
case 'outertext': return $this->_[HDOM_INFO_OUTER] = $value;
case 'innertext':
if (isset($this->_[HDOM_INFO_TEXT])) return $this->_[HDOM_INFO_TEXT] = $value;
return $this->_[HDOM_INFO_INNER] = $value;
}
if (!isset($this->attr[$name])) {
$this->_[HDOM_INFO_SPACE][] = array(' ', '', '');
$this->_[HDOM_INFO_QUOTE][] = HDOM_QUOTE_DOUBLE;
}
$this->attr[$name] = $value;
}
function __isset($name) {
switch ($name) {
case 'outertext': return true;
case 'innertext': return true;
case 'plaintext': return true;
}
//no value attr: nowrap, checked selected...
return (array_key_exists($name, $this->attr)) ? true : isset($this->attr[$name]);
}
function __unset($name) {
if (isset($this->attr[$name]))
unset($this->attr[$name]);
}
// PaperG - Function to convert the text from one character set to another if the two sets are not the same.
function convert_text($text)
{
global $debugObject;
if (is_object($debugObject)) {$debugObject->debugLogEntry(1);}
$converted_text = $text;
$sourceCharset = "";
$targetCharset = "";
if ($this->dom)
{
$sourceCharset = strtoupper($this->dom->_charset);
$targetCharset = strtoupper($this->dom->_target_charset);
}
if (is_object($debugObject)) {$debugObject->debugLog(3, "source charset: " . $sourceCharset . " target charaset: " . $targetCharset);}
if (!empty($sourceCharset) && !empty($targetCharset) && (strcasecmp($sourceCharset, $targetCharset) != 0))
{
// Check if the reported encoding could have been incorrect and the text is actually already UTF-8
if ((strcasecmp($targetCharset, 'UTF-8') == 0) && ($this->is_utf8($text)))
{
$converted_text = $text;
}
else
{
$converted_text = iconv($sourceCharset, $targetCharset, $text);
}
}
// Lets make sure that we don't have that silly BOM issue with any of the utf-8 text we output.
if ($targetCharset == 'UTF-8')
{
if (substr($converted_text, 0, 3) == "\xef\xbb\xbf")
{
$converted_text = substr($converted_text, 3);
}
if (substr($converted_text, -3) == "\xef\xbb\xbf")
{
$converted_text = substr($converted_text, 0, -3);
}
}
return $converted_text;
}
/**
* Returns true if $string is valid UTF-8 and false otherwise.
*
* @param mixed $str String to be tested
* @return boolean
*/
static function is_utf8($str)
{
$c=0; $b=0;
$bits=0;
$len=strlen($str);
for($i=0; $i<$len; $i++)
{
$c=ord($str[$i]);
if($c > 128)
{
if(($c >= 254)) return false;
elseif($c >= 252) $bits=6;
elseif($c >= 248) $bits=5;
elseif($c >= 240) $bits=4;
elseif($c >= 224) $bits=3;
elseif($c >= 192) $bits=2;
else return false;
if(($i+$bits) > $len) return false;
while($bits > 1)
{
$i++;
$b=ord($str[$i]);
if($b < 128 || $b > 191) return false;
$bits--;
}
}
}
return true;
}
/*
function is_utf8($string)
{
//this is buggy
return (utf8_encode(utf8_decode($string)) == $string);
}
*/
/**
* Function to try a few tricks to determine the displayed size of an img on the page.
* NOTE: This will ONLY work on an IMG tag. Returns FALSE on all other tag types.
*
* @author John Schlick
* @version April 19 2012
* @return array an array containing the 'height' and 'width' of the image on the page or -1 if we can't figure it out.
*/
function get_display_size()
{
global $debugObject;
$width = -1;
$height = -1;
if ($this->tag !== 'img')
{
return false;
}
// See if there is aheight or width attribute in the tag itself.
if (isset($this->attr['width']))
{
$width = $this->attr['width'];
}
if (isset($this->attr['height']))
{
$height = $this->attr['height'];
}
// Now look for an inline style.
if (isset($this->attr['style']))
{
// Thanks to user gnarf from stackoverflow for this regular expression.
$attributes = array();
preg_match_all("/([\w-]+)\s*:\s*([^;]+)\s*;?/", $this->attr['style'], $matches, PREG_SET_ORDER);
foreach ($matches as $match) {
$attributes[$match[1]] = $match[2];
}
// If there is a width in the style attributes:
if (isset($attributes['width']) && $width == -1)
{
// check that the last two characters are px (pixels)
if (strtolower(substr($attributes['width'], -2)) == 'px')
{
$proposed_width = substr($attributes['width'], 0, -2);
// Now make sure that it's an integer and not something stupid.
if (filter_var($proposed_width, FILTER_VALIDATE_INT))
{
$width = $proposed_width;
}
}
}
// If there is a width in the style attributes:
if (isset($attributes['height']) && $height == -1)
{
// check that the last two characters are px (pixels)
if (strtolower(substr($attributes['height'], -2)) == 'px')
{
$proposed_height = substr($attributes['height'], 0, -2);
// Now make sure that it's an integer and not something stupid.
if (filter_var($proposed_height, FILTER_VALIDATE_INT))
{
$height = $proposed_height;
}
}
}
}
// Future enhancement:
// Look in the tag to see if there is a class or id specified that has a height or width attribute to it.
// Far future enhancement
// Look at all the parent tags of this image to see if they specify a class or id that has an img selector that specifies a height or width
// Note that in this case, the class or id will have the img subselector for it to apply to the image.
// ridiculously far future development
// If the class or id is specified in a SEPARATE css file thats not on the page, go get it and do what we were just doing for the ones on the page.
$result = array('height' => $height,
'width' => $width);
return $result;
}
// camel naming conventions
function getAllAttributes() {return $this->attr;}
function getAttribute($name) {return $this->__get($name);}
function setAttribute($name, $value) {$this->__set($name, $value);}
function hasAttribute($name) {return $this->__isset($name);}
function removeAttribute($name) {$this->__set($name, null);}
function getElementById($id) {return $this->find("#$id", 0);}
function getElementsById($id, $idx=null) {return $this->find("#$id", $idx);}
function getElementByTagName($name) {return $this->find($name, 0);}
function getElementsByTagName($name, $idx=null) {return $this->find($name, $idx);}
function parentNode() {return $this->parent();}
function childNodes($idx=-1) {return $this->children($idx);}
function firstChild() {return $this->first_child();}
function lastChild() {return $this->last_child();}
function nextSibling() {return $this->next_sibling();}
function previousSibling() {return $this->prev_sibling();}
function hasChildNodes() {return $this->has_child();}
function nodeName() {return $this->tag;}
function appendChild($node) {$node->parent($this); return $node;}
}
/**
* simple html dom parser
* Paperg - in the find routine: allow us to specify that we want case insensitive testing of the value of the selector.
* Paperg - change $size from protected to public so we can easily access it
* Paperg - added ForceTagsClosed in the constructor which tells us whether we trust the html or not. Default is to NOT trust it.
*
* @package PlaceLocalInclude
*/
class simple_html_dom
{
public $root = null;
public $nodes = array();
public $callback = null;
public $lowercase = false;
// Used to keep track of how large the text was when we started.
public $original_size;
public $size;
protected $pos;
protected $doc;
protected $char;
protected $cursor;
protected $parent;
protected $noise = array();
protected $token_blank = " \t\r\n";
protected $token_equal = ' =/>';
protected $token_slash = " />\r\n\t";
protected $token_attr = ' >';
// Note that this is referenced by a child node, and so it needs to be public for that node to see this information.
public $_charset = '';
public $_target_charset = '';
protected $default_br_text = "";
public $default_span_text = "";
// use isset instead of in_array, performance boost about 30%...
protected $self_closing_tags = array('img'=>1, 'br'=>1, 'input'=>1, 'meta'=>1, 'link'=>1, 'hr'=>1, 'base'=>1, 'embed'=>1, 'spacer'=>1);
protected $block_tags = array('root'=>1, 'body'=>1, 'form'=>1, 'div'=>1, 'span'=>1, 'table'=>1);
// Known sourceforge issue #2977341
// B tags that are not closed cause us to return everything to the end of the document.
protected $optional_closing_tags = array(
'tr'=>array('tr'=>1, 'td'=>1, 'th'=>1),
'th'=>array('th'=>1),
'td'=>array('td'=>1),
'li'=>array('li'=>1),
'dt'=>array('dt'=>1, 'dd'=>1),
'dd'=>array('dd'=>1, 'dt'=>1),
'dl'=>array('dd'=>1, 'dt'=>1),
'p'=>array('p'=>1),
'nobr'=>array('nobr'=>1),
'b'=>array('b'=>1),
'option'=>array('option'=>1),
);
function __construct($str=null, $lowercase=true, $forceTagsClosed=true, $target_charset=DEFAULT_TARGET_CHARSET, $stripRN=true, $defaultBRText=DEFAULT_BR_TEXT, $defaultSpanText=DEFAULT_SPAN_TEXT)
{
if ($str)
{
if (preg_match("/^http:\/\//i",$str) || is_file($str))
{
$this->load_file($str);
}
else
{
$this->load($str, $lowercase, $stripRN, $defaultBRText, $defaultSpanText);
}
}
// Forcing tags to be closed implies that we don't trust the html, but it can lead to parsing errors if we SHOULD trust the html.
if (!$forceTagsClosed) {
$this->optional_closing_array=array();
}
$this->_target_charset = $target_charset;
}
function __destruct()
{
$this->clear();
}
// load html from string
function load($str, $lowercase=true, $stripRN=true, $defaultBRText=DEFAULT_BR_TEXT, $defaultSpanText=DEFAULT_SPAN_TEXT)
{
global $debugObject;
// prepare
$this->prepare($str, $lowercase, $stripRN, $defaultBRText, $defaultSpanText);
// strip out comments
$this->remove_noise("'<!--(.*?)-->'is");
// strip out cdata
$this->remove_noise("'<!\[CDATA\[(.*?)\]\]>'is", true);
// Per sourceforge http://sourceforge.net/tracker/?func=detail&aid=2949097&group_id=218559&atid=1044037
// Script tags removal now preceeds style tag removal.
// strip out <script> tags
$this->remove_noise("'<\s*script[^>]*[^/]>(.*?)<\s*/\s*script\s*>'is");
$this->remove_noise("'<\s*script\s*>(.*?)<\s*/\s*script\s*>'is");
// strip out <style> tags
$this->remove_noise("'<\s*style[^>]*[^/]>(.*?)<\s*/\s*style\s*>'is");
$this->remove_noise("'<\s*style\s*>(.*?)<\s*/\s*style\s*>'is");
// strip out preformatted tags
$this->remove_noise("'<\s*(?:code)[^>]*>(.*?)<\s*/\s*(?:code)\s*>'is");
// strip out server side scripts
$this->remove_noise("'(<\?)(.*?)(\?>)'s", true);
// strip smarty scripts
$this->remove_noise("'(\{\w)(.*?)(\})'s", true);
// parsing
while ($this->parse());
// end
$this->root->_[HDOM_INFO_END] = $this->cursor;
$this->parse_charset();
// make load function chainable
return $this;
}
// load html from file
function load_file()
{
$args = func_get_args();
$this->load(call_user_func_array('file_get_contents', $args), true);
// Throw an error if we can't properly load the dom.
if (($error=error_get_last())!==null) {
$this->clear();
return false;
}
}
// set callback function
function set_callback($function_name)
{
$this->callback = $function_name;
}
// remove callback function
function remove_callback()
{
$this->callback = null;
}
// save dom as string
function save($filepath='')
{
$ret = $this->root->innertext();
if ($filepath!=='') file_put_contents($filepath, $ret, LOCK_EX);
return $ret;
}
// find dom node by css selector
// Paperg - allow us to specify that we want case insensitive testing of the value of the selector.
function find($selector, $idx=null, $lowercase=false)
{
return $this->root->find($selector, $idx, $lowercase);
}
// clean up memory due to php5 circular references memory leak...
function clear()
{
foreach ($this->nodes as $n) {$n->clear(); $n = null;}
// This add next line is documented in the sourceforge repository. 2977248 as a fix for ongoing memory leaks that occur even with the use of clear.
if (isset($this->children)) foreach ($this->children as $n) {$n->clear(); $n = null;}
if (isset($this->parent)) {$this->parent->clear(); unset($this->parent);}
if (isset($this->root)) {$this->root->clear(); unset($this->root);}
unset($this->doc);
unset($this->noise);
}
function dump($show_attr=true)
{
$this->root->dump($show_attr);
}
// prepare HTML data and init everything
protected function prepare($str, $lowercase=true, $stripRN=true, $defaultBRText=DEFAULT_BR_TEXT, $defaultSpanText=DEFAULT_SPAN_TEXT)
{
$this->clear();
// set the length of content before we do anything to it.
$this->size = strlen($str);
// Save the original size of the html that we got in. It might be useful to someone.
$this->original_size = $this->size;
//before we save the string as the doc... strip out the \r \n's if we are told to.
if ($stripRN) {
$str = str_replace("\r", " ", $str);
$str = str_replace("\n", " ", $str);
// set the length of content since we have changed it.
$this->size = strlen($str);
}
$this->doc = $str;
$this->pos = 0;
$this->cursor = 1;
$this->noise = array();
$this->nodes = array();
$this->lowercase = $lowercase;
$this->default_br_text = $defaultBRText;
$this->default_span_text = $defaultSpanText;
$this->root = new simple_html_dom_node($this);
$this->root->tag = 'root';
$this->root->_[HDOM_INFO_BEGIN] = -1;
$this->root->nodetype = HDOM_TYPE_ROOT;
$this->parent = $this->root;
if ($this->size>0) $this->char = $this->doc[0];
}
// parse html content
protected function parse()
{
if (($s = $this->copy_until_char('<'))==='')
{
return $this->read_tag();
}
// text
$node = new simple_html_dom_node($this);
++$this->cursor;
$node->_[HDOM_INFO_TEXT] = $s;
$this->link_nodes($node, false);
return true;
}
// PAPERG - dkchou - added this to try to identify the character set of the page we have just parsed so we know better how to spit it out later.
// NOTE: IF you provide a routine called get_last_retrieve_url_contents_content_type which returns the CURLINFO_CONTENT_TYPE from the last curl_exec
// (or the content_type header from the last transfer), we will parse THAT, and if a charset is specified, we will use it over any other mechanism.
protected function parse_charset()
{
global $debugObject;
$charset = null;
if (function_exists('get_last_retrieve_url_contents_content_type'))
{
$contentTypeHeader = get_last_retrieve_url_contents_content_type();
$success = preg_match('/charset=(.+)/', $contentTypeHeader, $matches);
if ($success)
{
$charset = $matches[1];
if (is_object($debugObject)) {$debugObject->debugLog(2, 'header content-type found charset of: ' . $charset);}
}
}
if (empty($charset))
{
$el = $this->root->find('meta[http-equiv=Content-Type]',0);
if (!empty($el))
{
$fullvalue = $el->content;
if (is_object($debugObject)) {$debugObject->debugLog(2, 'meta content-type tag found' . $fullvalue);}
if (!empty($fullvalue))
{
$success = preg_match('/charset=(.+)/', $fullvalue, $matches);
if ($success)
{
$charset = $matches[1];
}
else
{
// If there is a meta tag, and they don't specify the character set, research says that it's typically ISO-8859-1
if (is_object($debugObject)) {$debugObject->debugLog(2, 'meta content-type tag couldn\'t be parsed. using iso-8859 default.');}
$charset = 'ISO-8859-1';
}
}
}
}
// If we couldn't find a charset above, then lets try to detect one based on the text we got...
if (empty($charset))
{
// Have php try to detect the encoding from the text given to us.
$charset = mb_detect_encoding($this->root->plaintext . "ascii", $encoding_list = array( "UTF-8", "CP1252" ) );
if (is_object($debugObject)) {$debugObject->debugLog(2, 'mb_detect found: ' . $charset);}
// and if this doesn't work... then we need to just wrongheadedly assume it's UTF-8 so that we can move on - cause this will usually give us most of what we need...
if ($charset === false)
{
if (is_object($debugObject)) {$debugObject->debugLog(2, 'since mb_detect failed - using default of utf-8');}
$charset = 'UTF-8';
}
}
// Since CP1252 is a superset, if we get one of it's subsets, we want it instead.
if ((strtolower($charset) == strtolower('ISO-8859-1')) || (strtolower($charset) == strtolower('Latin1')) || (strtolower($charset) == strtolower('Latin-1')))
{
if (is_object($debugObject)) {$debugObject->debugLog(2, 'replacing ' . $charset . ' with CP1252 as its a superset');}
$charset = 'CP1252';
}
if (is_object($debugObject)) {$debugObject->debugLog(1, 'EXIT - ' . $charset);}
return $this->_charset = $charset;
}
// read tag info
protected function read_tag()
{
if ($this->char!=='<')
{
$this->root->_[HDOM_INFO_END] = $this->cursor;
return false;
}
$begin_tag_pos = $this->pos;
$this->char = (++$this->pos<$this->size) ? $this->doc[$this->pos] : null; // next
// end tag
if ($this->char==='/')
{
$this->char = (++$this->pos<$this->size) ? $this->doc[$this->pos] : null; // next
// This represents the change in the simple_html_dom trunk from revision 180 to 181.
// $this->skip($this->token_blank_t);
$this->skip($this->token_blank);
$tag = $this->copy_until_char('>');
// skip attributes in end tag
if (($pos = strpos($tag, ' '))!==false)
$tag = substr($tag, 0, $pos);
$parent_lower = strtolower($this->parent->tag);
$tag_lower = strtolower($tag);
if ($parent_lower!==$tag_lower)
{
if (isset($this->optional_closing_tags[$parent_lower]) && isset($this->block_tags[$tag_lower]))
{
$this->parent->_[HDOM_INFO_END] = 0;
$org_parent = $this->parent;
while (($this->parent->parent) && strtolower($this->parent->tag)!==$tag_lower)
$this->parent = $this->parent->parent;
if (strtolower($this->parent->tag)!==$tag_lower) {
$this->parent = $org_parent; // restore origonal parent
if ($this->parent->parent) $this->parent = $this->parent->parent;
$this->parent->_[HDOM_INFO_END] = $this->cursor;
return $this->as_text_node($tag);
}
}
else if (($this->parent->parent) && isset($this->block_tags[$tag_lower]))
{
$this->parent->_[HDOM_INFO_END] = 0;
$org_parent = $this->parent;
while (($this->parent->parent) && strtolower($this->parent->tag)!==$tag_lower)
$this->parent = $this->parent->parent;
if (strtolower($this->parent->tag)!==$tag_lower)
{
$this->parent = $org_parent; // restore origonal parent
$this->parent->_[HDOM_INFO_END] = $this->cursor;
return $this->as_text_node($tag);
}
}
else if (($this->parent->parent) && strtolower($this->parent->parent->tag)===$tag_lower)
{
$this->parent->_[HDOM_INFO_END] = 0;
$this->parent = $this->parent->parent;
}
else
return $this->as_text_node($tag);
}
$this->parent->_[HDOM_INFO_END] = $this->cursor;
if ($this->parent->parent) $this->parent = $this->parent->parent;
$this->char = (++$this->pos<$this->size) ? $this->doc[$this->pos] : null; // next
return true;
}
$node = new simple_html_dom_node($this);
$node->_[HDOM_INFO_BEGIN] = $this->cursor;
++$this->cursor;
$tag = $this->copy_until($this->token_slash);
$node->tag_start = $begin_tag_pos;
// doctype, cdata & comments...
if (isset($tag[0]) && $tag[0]==='!') {
$node->_[HDOM_INFO_TEXT] = '<' . $tag . $this->copy_until_char('>');
if (isset($tag[2]) && $tag[1]==='-' && $tag[2]==='-') {
$node->nodetype = HDOM_TYPE_COMMENT;
$node->tag = 'comment';
} else {
$node->nodetype = HDOM_TYPE_UNKNOWN;
$node->tag = 'unknown';
}
if ($this->char==='>') $node->_[HDOM_INFO_TEXT].='>';
$this->link_nodes($node, true);
$this->char = (++$this->pos<$this->size) ? $this->doc[$this->pos] : null; // next
return true;
}
// text
if ($pos=strpos($tag, '<')!==false) {
$tag = '<' . substr($tag, 0, -1);
$node->_[HDOM_INFO_TEXT] = $tag;
$this->link_nodes($node, false);
$this->char = $this->doc[--$this->pos]; // prev
return true;
}
if (!preg_match("/^[\w-:]+$/", $tag)) {
$node->_[HDOM_INFO_TEXT] = '<' . $tag . $this->copy_until('<>');
if ($this->char==='<') {
$this->link_nodes($node, false);
return true;
}
if ($this->char==='>') $node->_[HDOM_INFO_TEXT].='>';
$this->link_nodes($node, false);
$this->char = (++$this->pos<$this->size) ? $this->doc[$this->pos] : null; // next
return true;
}
// begin tag
$node->nodetype = HDOM_TYPE_ELEMENT;
$tag_lower = strtolower($tag);
$node->tag = ($this->lowercase) ? $tag_lower : $tag;
// handle optional closing tags
if (isset($this->optional_closing_tags[$tag_lower]) )
{
while (isset($this->optional_closing_tags[$tag_lower][strtolower($this->parent->tag)]))
{
$this->parent->_[HDOM_INFO_END] = 0;
$this->parent = $this->parent->parent;
}
$node->parent = $this->parent;
}
$guard = 0; // prevent infinity loop
$space = array($this->copy_skip($this->token_blank), '', '');
// attributes
do
{
if ($this->char!==null && $space[0]==='')
{
break;
}
$name = $this->copy_until($this->token_equal);
if ($guard===$this->pos)
{
$this->char = (++$this->pos<$this->size) ? $this->doc[$this->pos] : null; // next
continue;
}
$guard = $this->pos;
// handle endless '<'
if ($this->pos>=$this->size-1 && $this->char!=='>') {
$node->nodetype = HDOM_TYPE_TEXT;
$node->_[HDOM_INFO_END] = 0;
$node->_[HDOM_INFO_TEXT] = '<'.$tag . $space[0] . $name;
$node->tag = 'text';
$this->link_nodes($node, false);
return true;
}
// handle mismatch '<'
if ($this->doc[$this->pos-1]=='<') {
$node->nodetype = HDOM_TYPE_TEXT;
$node->tag = 'text';
$node->attr = array();
$node->_[HDOM_INFO_END] = 0;
$node->_[HDOM_INFO_TEXT] = substr($this->doc, $begin_tag_pos, $this->pos-$begin_tag_pos-1);
$this->pos -= 2;
$this->char = (++$this->pos<$this->size) ? $this->doc[$this->pos] : null; // next
$this->link_nodes($node, false);
return true;
}
if ($name!=='/' && $name!=='') {
$space[1] = $this->copy_skip($this->token_blank);
$name = $this->restore_noise($name);
if ($this->lowercase) $name = strtolower($name);
if ($this->char==='=') {
$this->char = (++$this->pos<$this->size) ? $this->doc[$this->pos] : null; // next
$this->parse_attr($node, $name, $space);
}
else {
//no value attr: nowrap, checked selected...
$node->_[HDOM_INFO_QUOTE][] = HDOM_QUOTE_NO;
$node->attr[$name] = true;
if ($this->char!='>') $this->char = $this->doc[--$this->pos]; // prev
}
$node->_[HDOM_INFO_SPACE][] = $space;
$space = array($this->copy_skip($this->token_blank), '', '');
}
else
break;
} while ($this->char!=='>' && $this->char!=='/');
$this->link_nodes($node, true);
$node->_[HDOM_INFO_ENDSPACE] = $space[0];
// check self closing
if ($this->copy_until_char_escape('>')==='/')
{
$node->_[HDOM_INFO_ENDSPACE] .= '/';
$node->_[HDOM_INFO_END] = 0;
}
else
{
// reset parent
if (!isset($this->self_closing_tags[strtolower($node->tag)])) $this->parent = $node;
}
$this->char = (++$this->pos<$this->size) ? $this->doc[$this->pos] : null; // next
// If it's a BR tag, we need to set it's text to the default text.
// This way when we see it in plaintext, we can generate formatting that the user wants.
// since a br tag never has sub nodes, this works well.
if ($node->tag == "br")
{
$node->_[HDOM_INFO_INNER] = $this->default_br_text;
}
return true;
}
// parse attributes
protected function parse_attr($node, $name, &$space)
{
// Per sourceforge: http://sourceforge.net/tracker/?func=detail&aid=3061408&group_id=218559&atid=1044037
// If the attribute is already defined inside a tag, only pay atetntion to the first one as opposed to the last one.
if (isset($node->attr[$name]))
{
return;
}
$space[2] = $this->copy_skip($this->token_blank);
switch ($this->char) {
case '"':
$node->_[HDOM_INFO_QUOTE][] = HDOM_QUOTE_DOUBLE;
$this->char = (++$this->pos<$this->size) ? $this->doc[$this->pos] : null; // next
$node->attr[$name] = $this->restore_noise($this->copy_until_char_escape('"'));
$this->char = (++$this->pos<$this->size) ? $this->doc[$this->pos] : null; // next
break;
case '\'':
$node->_[HDOM_INFO_QUOTE][] = HDOM_QUOTE_SINGLE;
$this->char = (++$this->pos<$this->size) ? $this->doc[$this->pos] : null; // next
$node->attr[$name] = $this->restore_noise($this->copy_until_char_escape('\''));
$this->char = (++$this->pos<$this->size) ? $this->doc[$this->pos] : null; // next
break;
default:
$node->_[HDOM_INFO_QUOTE][] = HDOM_QUOTE_NO;
$node->attr[$name] = $this->restore_noise($this->copy_until($this->token_attr));
}
// PaperG: Attributes should not have \r or \n in them, that counts as html whitespace.
$node->attr[$name] = str_replace("\r", "", $node->attr[$name]);
$node->attr[$name] = str_replace("\n", "", $node->attr[$name]);
// PaperG: If this is a "class" selector, lets get rid of the preceeding and trailing space since some people leave it in the multi class case.
if ($name == "class") {
$node->attr[$name] = trim($node->attr[$name]);
}
}
// link node's parent
protected function link_nodes(&$node, $is_child)
{
$node->parent = $this->parent;
$this->parent->nodes[] = $node;
if ($is_child)
{
$this->parent->children[] = $node;
}
}
// as a text node
protected function as_text_node($tag)
{
$node = new simple_html_dom_node($this);
++$this->cursor;
$node->_[HDOM_INFO_TEXT] = '</' . $tag . '>';
$this->link_nodes($node, false);
$this->char = (++$this->pos<$this->size) ? $this->doc[$this->pos] : null; // next
return true;
}
protected function skip($chars)
{
$this->pos += strspn($this->doc, $chars, $this->pos);
$this->char = ($this->pos<$this->size) ? $this->doc[$this->pos] : null; // next
}
protected function copy_skip($chars)
{
$pos = $this->pos;
$len = strspn($this->doc, $chars, $pos);
$this->pos += $len;
$this->char = ($this->pos<$this->size) ? $this->doc[$this->pos] : null; // next
if ($len===0) return '';
return substr($this->doc, $pos, $len);
}
protected function copy_until($chars)
{
$pos = $this->pos;
$len = strcspn($this->doc, $chars, $pos);
$this->pos += $len;
$this->char = ($this->pos<$this->size) ? $this->doc[$this->pos] : null; // next
return substr($this->doc, $pos, $len);
}
protected function copy_until_char($char)
{
if ($this->char===null) return '';
if (($pos = strpos($this->doc, $char, $this->pos))===false) {
$ret = substr($this->doc, $this->pos, $this->size-$this->pos);
$this->char = null;
$this->pos = $this->size;
return $ret;
}
if ($pos===$this->pos) return '';
$pos_old = $this->pos;
$this->char = $this->doc[$pos];
$this->pos = $pos;
return substr($this->doc, $pos_old, $pos-$pos_old);
}
protected function copy_until_char_escape($char)
{
if ($this->char===null) return '';
$start = $this->pos;
while (1)
{
if (($pos = strpos($this->doc, $char, $start))===false)
{
$ret = substr($this->doc, $this->pos, $this->size-$this->pos);
$this->char = null;
$this->pos = $this->size;
return $ret;
}
if ($pos===$this->pos) return '';
if ($this->doc[$pos-1]==='\\') {
$start = $pos+1;
continue;
}
$pos_old = $this->pos;
$this->char = $this->doc[$pos];
$this->pos = $pos;
return substr($this->doc, $pos_old, $pos-$pos_old);
}
}
// remove noise from html content
// save the noise in the $this->noise array.
protected function remove_noise($pattern, $remove_tag=false)
{
global $debugObject;
if (is_object($debugObject)) { $debugObject->debugLogEntry(1); }
$count = preg_match_all($pattern, $this->doc, $matches, PREG_SET_ORDER|PREG_OFFSET_CAPTURE);
for ($i=$count-1; $i>-1; --$i)
{
$key = '___noise___'.sprintf('% 5d', count($this->noise)+1000);
if (is_object($debugObject)) { $debugObject->debugLog(2, 'key is: ' . $key); }
$idx = ($remove_tag) ? 0 : 1;
$this->noise[$key] = $matches[$i][$idx][0];
$this->doc = substr_replace($this->doc, $key, $matches[$i][$idx][1], strlen($matches[$i][$idx][0]));
}
// reset the length of content
$this->size = strlen($this->doc);
if ($this->size>0)
{
$this->char = $this->doc[0];
}
}
// restore noise to html content
function restore_noise($text)
{
global $debugObject;
if (is_object($debugObject)) { $debugObject->debugLogEntry(1); }
while (($pos=strpos($text, '___noise___'))!==false)
{
// Sometimes there is a broken piece of markup, and we don't GET the pos+11 etc... token which indicates a problem outside of us...
if (strlen($text) > $pos+15)
{
$key = '___noise___'.$text[$pos+11].$text[$pos+12].$text[$pos+13].$text[$pos+14].$text[$pos+15];
if (is_object($debugObject)) { $debugObject->debugLog(2, 'located key of: ' . $key); }
if (isset($this->noise[$key]))
{
$text = substr($text, 0, $pos).$this->noise[$key].substr($text, $pos+16);
}
else
{
// do this to prevent an infinite loop.
$text = substr($text, 0, $pos).'UNDEFINED NOISE FOR KEY: '.$key . substr($text, $pos+16);
}
}
else
{
// There is no valid key being given back to us... We must get rid of the ___noise___ or we will have a problem.
$text = substr($text, 0, $pos).'NO NUMERIC NOISE KEY' . substr($text, $pos+11);
}
}
return $text;
}
// Sometimes we NEED one of the noise elements.
function search_noise($text)
{
global $debugObject;
if (is_object($debugObject)) { $debugObject->debugLogEntry(1); }
foreach($this->noise as $noiseElement)
{
if (strpos($noiseElement, $text)!==false)
{
return $noiseElement;
}
}
}
function __toString()
{
return $this->root->innertext();
}
function __get($name)
{
switch ($name)
{
case 'outertext':
return $this->root->innertext();
case 'innertext':
return $this->root->innertext();
case 'plaintext':
return $this->root->text();
case 'charset':
return $this->_charset;
case 'target_charset':
return $this->_target_charset;
}
}
// camel naming conventions
function childNodes($idx=-1) {return $this->root->childNodes($idx);}
function firstChild() {return $this->root->first_child();}
function lastChild() {return $this->root->last_child();}
function createElement($name, $value=null) {return @str_get_html("<$name>$value</$name>")->first_child();}
function createTextNode($value) {return @end(str_get_html($value)->nodes);}
function getElementById($id) {return $this->find("#$id", 0);}
function getElementsById($id, $idx=null) {return $this->find("#$id", $idx);}
function getElementByTagName($name) {return $this->find($name, 0);}
function getElementsByTagName($name, $idx=-1) {return $this->find($name, $idx);}
function loadFile() {$args = func_get_args();$this->load_file($args);}
}
?>