Attempting A Nested Scrape Using Beautifulsoup
Hello
Solution 1:
If you were having problems with nextSibling
it's because your html actually looks like this:
<h1><aname="hello">Hello</a></h1>\n #<---newline
<divclass="colmask">
See the newline after the </h1>
? Even though a newline is invisible, it is still considered text, and therefore it becomes a BeautifulSoup element(a NavigableString), and it's considered the nextSibling
of the <h1>
tag.
Newlines can also present problems when trying to get, say, the third child of the following <div>
:
<div>
<div>hello</div>
<div>world</div>
<div>goodbye</div>
<div>
Here is the numbering of the children:
<div>\n #<---newline plus spaces at start of next line = child 0
<div>hello</div>\n #<--newline plus spaces at start of next line = child 2
<div>world</div>\n #<--newline plus spaces at start of next line = child 4
<div>goodbye</div>\n #<--newline = child 6
<div>
The divs are actually children numbers 1, 3, and 5. If you are having trouble parsing html, then 101% of the time it's because the newlines at the end of each line are tripping you up. The newlines always have to be accounted for and factored into your thinking about where things are located.
To get the <div>
tag here:
<h1><aname="hello">Hello</a></h1>\n #<---newline
<divclass="colmask">
...you could write:
h1.nextSibling.nextSibling
But to skip ALL the whitespace between tags, it's easier to use findNextSibling()
, which allows you to specify the tag name of the next sibling you want to locate:
findNextSibling('div')
Here is an example:
from BeautifulSoup import BeautifulSoup
withopen('data2.txt') as f:
html = f.read()
soup = BeautifulSoup(html)
for h1 in soup.findAll('h1'):
colmask_div = h1.findNextSibling('div')
for box_div in colmask_div.findAll('div'):
h4 = box_div.find('h4')
for ul in box_div.findAll('ul'):
print'{} : {} : {}'.format(h1.text, h4.text, ul.li.a.text)
--output:--
Hello : My Favorite Number is : 1
Hello : My Favorite Number is : 2
Hello : My Favorite Number is : 3
Hello : My Favorite Number is : 4
Hello : Your Favorite Number is : 1
Hello : Your Favorite Number is : 2
Hello : Your Favorite Number is : 3
Hello : Your Favorite Number is : 4
Goodbye : Their Favorite Number is : 1
Goodbye : Their Favorite Number is : 2
Goodbye : Their Favorite Number is : 3
Goodbye : Their Favorite Number is : 4
Goodbye : Our Favorite Number is : 1
Goodbye : Our Favorite Number is : 2
Goodbye : Our Favorite Number is : 3
Goodbye : Our Favorite Number is : 4
Post a Comment for "Attempting A Nested Scrape Using Beautifulsoup"