python - Requesting a robust regex that checks if an img tag contains the alt element in a HTML document -


i'm writing python script check html documents img tags. should check alt="" present inside of img tag. print out line number.

the regex have factor in different order of contents. eg:

<img class="" alt="" src=""> <img class="" src=""> <img src="" class=""> <img src=""> 

so yes, summerise. regex check of elements of img tag present must account range of possible arrangements

thank you

using regexes evaluation html bit risky, if you're willing accept shortcomings*, work using positive lookahead assertions:

regex = re.compile(r'<img (?=[^>]*\balt=")(?=[^>]*\bsrc=")(?=[^>]*\bclass=")') 

will match if current string contains <img that's followed (within same tag) alt=", src=" , class=", in order.

explanation:

<img    # match '<img' (?=     # assert it's possible match following position:  [^>]*  #  number of characters except >  \b     #  word boundary (here: start of word)  alt="  #  literal text 'alt="' )       # end of lookahead (?=[^>]*\bsrc=")   # same `src`, same position before (?=[^>]*\bclass=") # same `class`, same position before 

*of course regex ignorant whether tag it's matching within comment, interrupted comment, malformed, surrounded <pre> tags or other situation might change meaning actual html parser.


Comments

Popular posts from this blog

design - Custom Styling Qt Quick Controls -

Unable to remove the www from url on https using .htaccess -