Get Beautifulsoup To Correctly Parse Php Tags Or Ignore Them
Solution 1:
Your best bet is to remove all of the PHP elements before giving it to BeautifulSoup to parse. This can be done using a regular expression to spot all PHP sections and replace them with safe placeholder text.
After carrying out all of your modifications using BeautifulSoup, the PHP expressions can then be replaced.
As the PHP can be anywhere, i.e. also within a quoted string, it is best to use a simple unique string placeholder rather than trying to wrap it in an HTML comment (see php_sig
).
re.sub()
can be given a function. Each time the a substitution is made, the original PHP code is stored in an array (php_elements
). Then the reverse is done afterwards, i.e. search for all instances of php_sig
and replace them with the next element from php_elements
. If all goes well, php_elements
should be empty at the end, if it is not then your modifications have resulted in a place holder being removed.
from bs4 import BeautifulSoup
import re
html = """<html><body><?php$stars = $this->getData('sideBarCoStars', []);
if (!$stars) return;
$sideBarCoStarsCount = $this->getData('sideBarCoStarsCount');
$title = $this->getData('sideBarCoStarsTitle');
$viewAllUrl = $this->getData('sideBarCoStarsViewAllUrl');
$isDomain = $this->getData('isDomain');
$lazy_load = $lazy_load ?? 0;
$imageSrc = $this->getData('emptyImageData');
?><header><h3><ahref="<?phpecho$viewAllUrl; ?>"class="noContentLink white"><?phpecho"{$title} ({$sideBarCoStarsCount})"; ?></a></h3></body>"""
php_sig = '!!!PHP!!!'
php_elements = []
def php_remove(m):
php_elements.append(m.group())
return php_sig
def php_add(m):
return php_elements.pop(0)
# Pre-parse HTML to remove all PHP elements
html = re.sub(r'<\?php.*?\?>', php_remove, html, flags=re.S+re.M)
soup = BeautifulSoup(html, "html.parser")
# Make modifications to the soup
# Do not remove any elements containing PHP elements
# Post-parse HTML to replace the PHP elements
html = re.sub(php_sig, php_add, soup.prettify())
print(html)
Post a Comment for "Get Beautifulsoup To Correctly Parse Php Tags Or Ignore Them"