Python regex slow when whitespace in string -


i match strings, using pythons regex module.

in case want verify strings start, end , consist of upper case letters combined "_". example, following string valid: "my_hero2". following strings not valid: "_my_hreo2", "my hero2", "my_hero2_"

to validate string use code:

import re my_string = "my_hero"     p = re.compile("^([a-z,0-9]+_??)+[a-z,0-9]$") if p.match(my_string):     print "validated" 

so problem? validating long string containing whitespaces very, slow. how can avoid this? pattern wrong? reason behavior?

here numbers:

my_hero2 --> 53 ms my_super_great_unbelievable_hero --> 69 microseconds my_super_great_unbelievable hero --> 223576 microseconds my_super_great_unbelievable_strong_hero --> 15 microseconds my_super_great_unbelievable_strong hero --> 979429 microseconds 

thanks anwsers , responses in advance. :-) paul

so problem?

the problem catastrophic backtracking. regex engine trying whole lot of variations taking lot of time.

let's try pretty simple example: a_b d.

the engine first matches a [a-z,0-9]+ tries _?? since optional (lazy) skips now, , has finished ([a-z,0-9]+_??)+.

now engine tries match [a-z,0-9] there _ in string fails , needs backtrack, re-enters ([a-z,0-9]+_??)+ failed last time , tries _?? , succeeds.

now engine exits ([a-z,0-9]+_??)+ again , tries match [a-z,0-9] , succeeds, tries match end-of-string $ fails, backtracks , enters ([a-z,0-9]+_??)+ again.

i hope see going tiered of writing , haven't reached space character yet -actually character not accepted in regex such # or %, etc cause this, not whitespace- , small tiny example, in case of long strings, have hundreds , hundreds of times until able match entire string or fails, hence great amount of time.

validating long string containing whitespaces very, slow.

again due backtracking , hell of variations.

how can avoid this?

you can use regex instead:

^([a-z0-9]_?)*[a-z0-9]$ 

this ensures string starts uppercase or number followed optional _, repeat 1 or more times , make sure there uppercase or number @ end.

is pattern wrong? reason behavior?

your expression not wrong, highly inefficient.

([a-z,0-9]+_??)+[a-z,0-9]$           ^  ^ ^          |see two, lot of trouble               |             these 2 ?? the other 2 + on inside , outside, hell :) 

Comments

Popular posts from this blog

Unable to remove the www from url on https using .htaccess -