python - Requesting a robust regex that checks if an img tag contains the alt element in a HTML document -
i'm writing python script check html documents img tags. should check alt="" present inside of img tag. print out line number.
the regex have factor in different order of contents. eg:
<img class="" alt="" src=""> <img class="" src=""> <img src="" class=""> <img src="">
so yes, summerise. regex check of elements of img tag present must account range of possible arrangements
thank you
using regexes evaluation html bit risky, if you're willing accept shortcomings*, work using positive lookahead assertions:
regex = re.compile(r'<img (?=[^>]*\balt=")(?=[^>]*\bsrc=")(?=[^>]*\bclass=")')
will match if current string contains <img
that's followed (within same tag) alt="
, src="
, class="
, in order.
explanation:
<img # match '<img' (?= # assert it's possible match following position: [^>]* # number of characters except > \b # word boundary (here: start of word) alt=" # literal text 'alt="' ) # end of lookahead (?=[^>]*\bsrc=") # same `src`, same position before (?=[^>]*\bclass=") # same `class`, same position before
*of course regex ignorant whether tag it's matching within comment, interrupted comment, malformed, surrounded <pre>
tags or other situation might change meaning actual html parser.
Comments
Post a Comment