Scraping dynamic content in a website

Posted on

Question :

Scraping dynamic content in a website

I need to scrape news announcements from this website, Link.
The announcements seem to be generated dynamically. They dont appear in the source. I usually use mechanize but I assume it wouldnt work. What can I do for this? I’m ok with python or perl.

Asked By: Aks


Answer #1:

The polite option would be to ask the owners of the site if they have an API which allows you access to their news stories.

The less polite option would be to trace the HTTP transactions that take place while the page is loading and work out which one is the AJAX call which pulls in the data.

Looks like it’s this one. But it looks like it might contain session data, so I don’t know how long it will continue to work for.

Answered By: Dave Cross

Answer #2:

If the content is generated dynamically, you can use Windmill or Seleninum to drive the browser and get the data once it’s been rendered.

You can find an example here.

Answered By: jcollado

Answer #3:

There’s also WWW::Scripter “For scripting web sites that have scripts” . Never used it.

Answered By: Øyvind Skaar

Answer #4:

In python you can use urllib and urllib2 to connect to a website and collect data. For example:

from urllib2 import urlopen
myUrl = "!News/List"
inStream = urlopen(myUrl) # etc, in a while loop
# all your fun page parsing code (perhaps: import from xml.dom.minidom import parse)
Answered By: Adam Morris

Leave a Reply

Your email address will not be published.