I’ve been googling this all day with out finding the answer, so apologies in advance if this is already answered.
I’m trying to get all visible text from a large number of different websites. The reason is that I want to process the text to eventually categorize the websites.
After a couple of days of research, I decided that Selenium was my best chance. I’ve found a way to grab all the text, with Selenium, unfortunately the same text is being grabbed multiple times:
from selenium import webdriver import codecs filen = codecs.open('outoput.txt', encoding='utf-8', mode='w+') driver = webdriver.Firefox() driver.get("http://www.examplepage.com") allelements = driver.find_elements_by_xpath("//*") ferdigtxt =  for i in allelements: if i.text in ferdigtxt: pass else: ferdigtxt.append(i.text) filen.writelines(i.text) filen.close() driver.quit()
if condition inside the
for loop is an attempt at eliminating the problem of fetching the same text multiple times – it does not however, only work as planned on some webpages. (it also makes the script A LOT slower)
I’m guessing the reason for my problem is that – when asking for the inner text of an element – I also get the inner text of the elements nested inside the element in question.
Is there any way around this? Is there some sort of master element I grab the inner text of? Or a completely different way that would enable me to reach my goal? Any help would be greatly appreciated as I’m out of ideas for this one.
Using lxml, you might try something like this:
import contextlib import selenium.webdriver as webdriver import lxml.html as LH import lxml.html.clean as clean url="http://www.yahoo.com" ignore_tags=('script','noscript','style') with contextlib.closing(webdriver.Firefox()) as browser: browser.get(url) # Load page content=browser.page_source cleaner=clean.Cleaner() content=cleaner.clean_html(content) with open('/tmp/source.html','w') as f: f.write(content.encode('utf-8')) doc=LH.fromstring(content) with open('/tmp/result.txt','w') as f: for elt in doc.iterdescendants(): if elt.tag in ignore_tags: continue text=elt.text or '' tail=elt.tail or '' words=' '.join((text,tail)).strip() if words: words=words.encode('utf-8') f.write(words+'n')
Here’s a variation on @unutbu’s answer:
I’ve separated your task in two:
- extract text
The code is connected only through the cache. You can fetch pages in one process and extract text in another process or defer to do it later using a different algorithm.