I’m trying to scrape all the inner html from the
<p> elements in a web page using BeautifulSoup. There are internal tags, but I don’t care, I just want to get the internal text.
For example, for:
<p>Red</p> <p><i>Blue</i></p> <p>Yellow</p> <p>Light <b>green</b></p>
How can I extract:
Red Blue Yellow Light green
.contents does what I need. Nor does
.extract(), because I don’t want to have to specify the internal tags in advance – I want to deal with any that may occur.
Is there a ‘just get the visible HTML’ type of method in BeautifulSoup?
On advice, trying:
soup = BeautifulSoup(open("test.html")) p_tags = soup.findAll('p',text=True) for i, p_tag in enumerate(p_tags): print str(i) + p_tag
But that doesn’t help – it prints out:
0Red 1 2Blue 3 4Yellow 5 6Light 7green 8
To clarify, a working piece of code:
""" <p>Red</p> <p><i>Blue</i></p> <p>Yellow</p> <p>Light <b>green</b></p> """import BeautifulSoup BeautifulSoup.__version__ '3.0.7a' soup = BeautifulSoup.BeautifulSoup(txt) for node in soup.findAll('p'): print ''.join(node.findAll(text=True)) Red Blue Yellow Light greentxt =
The accepted answer is great but it is 6 years old now, so here’s the current Beautiful Soup 4 version of this answer:
""" <p>Red</p> <p><i>Blue</i></p> <p>Yellow</p> <p>Light <b>green</b></p> """ from bs4 import BeautifulSoup, __version__ __version__ '4.5.1' soup = BeautifulSoup(txt, "html.parser") print("".join(soup.strings)) Red Blue Yellow Light greentxt =
I have stumbled upon this very same problem and wanted to share the 2019 version of this solution. Maybe it helps somebody out.
# importing the modules from bs4 import BeautifulSoup from urllib.request import urlopen # setting up your BeautifulSoup Object webpage = urlopen("https://insertyourwebpage.com") soup = BeautifulSoup( webpage.read(), features="lxml") p_tags = soup.find_all('p') for each in p_tags: print (str(each.get_text()))
Notice that we’re first printing the array content one by one and THEN call the get_text() method that strips the tags from the text, so that we only print out the text.
- it is better to use the updated ‘find_all()’ in bs4 than the older findAll()
- urllib2 was replaced by urllib.request and urllib.error, see here
Now your output should be:
Hope this helps someone looking for an updated solution.
Normally the data scrapped from website will contains tags.To avoid that tags and show only text content, you can use text attribute.
from BeautifulSoup import BeautifulSoup import urllib2 url = urllib2.urlopen("https://www.python.org") content = url.read() soup = BeautifulSoup(content) title = soup.findAll("title") paragraphs = soup.findAll("p") print paragraphs //Second paragraph with tags print paragraphs.text //Second paragraph without tags
In this example, I collect all paragraphs from python site and display it with tags and without tags.
First, convert the html to a string using
str. Then, use the following code with your program:
import re x = str(soup.find_all('p')) content = str(re.sub("<.*?>", "", x))
This is called a
regex. This one will remove anything that comes between two html tags (inclusive of the tags).