Python regex slow when whitespace in string -
i match strings, using pythons regex module.
in case want verify strings start, end , consist of upper case letters combined "_". example, following string valid: "my_hero2". following strings not valid: "_my_hreo2", "my hero2", "my_hero2_"
to validate string use code:
import re my_string = "my_hero" p = re.compile("^([a-z,0-9]+_??)+[a-z,0-9]$") if p.match(my_string): print "validated" so problem? validating long string containing whitespaces very, slow. how can avoid this? pattern wrong? reason behavior?
here numbers:
my_hero2 --> 53 ms my_super_great_unbelievable_hero --> 69 microseconds my_super_great_unbelievable hero --> 223576 microseconds my_super_great_unbelievable_strong_hero --> 15 microseconds my_super_great_unbelievable_strong hero --> 979429 microseconds thanks anwsers , responses in advance. :-) paul
so problem?
the problem catastrophic backtracking. regex engine trying whole lot of variations taking lot of time.
let's try pretty simple example: a_b d.
the engine first matches a [a-z,0-9]+ tries _?? since optional (lazy) skips now, , has finished ([a-z,0-9]+_??)+.
now engine tries match [a-z,0-9] there _ in string fails , needs backtrack, re-enters ([a-z,0-9]+_??)+ failed last time , tries _?? , succeeds.
now engine exits ([a-z,0-9]+_??)+ again , tries match [a-z,0-9] , succeeds, tries match end-of-string $ fails, backtracks , enters ([a-z,0-9]+_??)+ again.
i hope see going tiered of writing , haven't reached space character yet -actually character not accepted in regex such # or %, etc cause this, not whitespace- , small tiny example, in case of long strings, have hundreds , hundreds of times until able match entire string or fails, hence great amount of time.
validating long string containing whitespaces very, slow.
again due backtracking , hell of variations.
how can avoid this?
you can use regex instead:
^([a-z0-9]_?)*[a-z0-9]$ this ensures string starts uppercase or number followed optional _, repeat 1 or more times , make sure there uppercase or number @ end.
is pattern wrong? reason behavior?
your expression not wrong, highly inefficient.
([a-z,0-9]+_??)+[a-z,0-9]$ ^ ^ ^ |see two, lot of trouble | these 2 ?? the other 2 + on inside , outside, hell :)
Comments
Post a Comment