I have written many scrapers but I am not really sure how to handle infinite scrollers. These days most website etc, Facebook, Pinterest has infinite scrollers.
You can use selenium to scrap the infinite scrolling website like twitter or facebook.
Step 1 : Install Selenium using pip
pip install selenium
Step 2 : use the code below to automate infinite scroll and extract the source code
from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.common.keys import Keys from selenium.webdriver.support.ui import Select from selenium.webdriver.support.ui import WebDriverWait from selenium.common.exceptions import TimeoutException from selenium.webdriver.support import expected_conditions as EC from selenium.common.exceptions import NoSuchElementException from selenium.common.exceptions import NoAlertPresentException import sys import unittest, time, re class Sel(unittest.TestCase): def setUp(self): self.driver = webdriver.Firefox() self.driver.implicitly_wait(30) self.base_url = "https://twitter.com" self.verificationErrors =  self.accept_next_alert = True def test_sel(self): driver = self.driver delay = 3 driver.get(self.base_url + "/search?q=stckoverflow&src=typd") driver.find_element_by_link_text("All").click() for i in range(1,100): self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight);") time.sleep(4) html_source = driver.page_source data = html_source.encode('utf-8') if __name__ == "__main__": unittest.main()
Step 3 : Print the data if required.
Most sites that have infinite scrolling do (as Lattyware notes) have a proper API as well, and you will likely be better served by using this rather than scraping.
But if you must scrape…
For example, open the Firefox Web Console, turn off all the filter buttons except Net, and load the site you wish to scrape. You’ll see all the files as they are loaded. Scroll the page while watching the Web Console and you’ll see the URLs being used for the additional requests. Then you can request that URL yourself and see what format the data is in (probably JSON) and get it into your Python script.
Finding the url of the ajax source will be the best option but it can be cumbersome for certain sites. Alternatively you could use a headless browser like
PyQt and send keyboard events while reading the data from the DOM tree.
QWebKit has a nice and simple api.