Python regex slow when whitespace in string -
i match strings, using pythons regex module.
in case want verify strings start, end , consist of upper case letters combined "_". example, following string valid: "my_hero2". following strings not valid: "_my_hreo2", "my hero2", "my_hero2_"
to validate string use code:
import re my_string = "my_hero" p = re.compile("^([a-z,0-9]+_??)+[a-z,0-9]$") if p.match(my_string): print "validated"
so problem? validating long string containing whitespaces very, slow. how can avoid this? pattern wrong? reason behavior?
here numbers:
my_hero2 --> 53 ms my_super_great_unbelievable_hero --> 69 microseconds my_super_great_unbelievable hero --> 223576 microseconds my_super_great_unbelievable_strong_hero --> 15 microseconds my_super_great_unbelievable_strong hero --> 979429 microseconds
thanks anwsers , responses in advance. :-) paul
so problem?
the problem catastrophic backtracking. regex engine trying whole lot of variations taking lot of time.
let's try pretty simple example: a_b d
.
the engine first matches a
[a-z,0-9]+
tries _??
since optional (lazy) skips now, , has finished ([a-z,0-9]+_??)+
.
now engine tries match [a-z,0-9]
there _
in string fails , needs backtrack, re-enters ([a-z,0-9]+_??)+
failed last time , tries _??
, succeeds.
now engine exits ([a-z,0-9]+_??)+
again , tries match [a-z,0-9]
, succeeds, tries match end-of-string $
fails, backtracks , enters ([a-z,0-9]+_??)+
again.
i hope see going tiered of writing , haven't reached space character yet -actually character not accepted in regex such #
or %
, etc cause this, not whitespace- , small tiny example, in case of long strings, have hundreds , hundreds of times until able match entire string or fails, hence great amount of time.
validating long string containing whitespaces very, slow.
again due backtracking , hell of variations.
how can avoid this?
you can use regex instead:
^([a-z0-9]_?)*[a-z0-9]$
this ensures string starts uppercase or number followed optional _
, repeat 1 or more times , make sure there uppercase or number @ end.
is pattern wrong? reason behavior?
your expression not wrong, highly inefficient.
([a-z,0-9]+_??)+[a-z,0-9]$ ^ ^ ^ |see two, lot of trouble | these 2 ?? the other 2 + on inside , outside, hell :)
Comments
Post a Comment