regex - What do we need Lookahead/Lookbehind Zero Width Assertions for? -

i've learned these 2 concepts in more detail. i've been regex, , seems i've never seen need these 2 0 width assertions.

i'm pretty sure i'm wrong, not see why these constructs needed. consider example:

match 'q' not followed 'u'.

2 strings input:

iraq quit

with negative lookahead, regex looks this:

q(?!u)

without it, looks this:

q[^u]

for given input, both of these regex give same results (i.e. matching iraq not quit) (tested perl). same idea applies lookbehinds.

am missing crucial feature makes these assertions more valuable classic syntax?

why test worked (and why shouldn't)

the reason able match iraq in test might string contained \n @ end (for instance, if read shell). if have string ends in q, q[^u] cannot match others said, because [^u] matches non-u character - point there has character.

what need lookarounds for?

obviously in above case, lookaheads not vital. workaround using q(?:[^u]|$). match if q followed non-u character or end of string. there more sophisticated uses lookaheads though, become pain if them without lookaheads.

this answer tries give overview of important standard situations best solved lookarounds.

let's start looking @ quoted strings. usual way match them "[^"]*" (not ".*?"). after opening ", repeat many non-quote characters possible , match closing quote. again, negated character class fine. there cases, negated character class doesn't cut it:

multi-character delimiters

now if don't have double-quotes delimit our substring of interest, multi-character delimiter. instance, looking ---sometext---, single , double - allowed within sometext. can't use [^-]*, because forbid single -. standard technique use negative lookahead @ every position, , consume next character, if not beginning of ---. so:

---(?:(?!---).)*---

this might bit complicated if haven't seen before, it's nicer (and more efficient) alternatives.

different delimiters

you similar case, delimiter 1 character 1 of 2 (or more) different characters. instance, in our initial example, want allow both single- , double-quoted strings. of course, use '[^']*'|"[^"]*", nice treat both cases without alternative. surrounding quotes can taken care of backreference: (['"])[^'"]*\1. makes sure match ends same character began with. we're restrictive - we'd allow " in single-quoted , ' in double-quoted strings. [^\1] doesn't work, because backreference in general contain more 1 character. use same technique above:

(['"])(?:(?!\1).)*\1

that after opening quote, before consuming each character make sure not same opening character. long possible, , match opening character again.

overlapping matches

this (completely different) problem can not solved @ without lookarounds. if search match globally (or want regex-replace globally), may have noticed matches can never overlap. i.e. if search ... in abcdefghi abc, def, ghi , not bcd, cde , on. can problem if want make sure match preceded (or surrounded) else.

say have csv file like

aaa,111,bbb,222,333,ccc

and want extract fields entirely numerical. simplicity, i'll assume there no leading or trailing whitespace anywhere. without lookarounds, might go capturing , try:

(?:^|,)(\d+)(?:,|$)

so make sure have start of field (start of string or ,), digits, , end of field (, or end of string). between capture digits group 1. unfortunately, not give 333 in above example, because , precedes part of match ,222, - , matches cannot overlap. lookarounds solve problem:

(?<=^|,)\d+(?=,|$)

or if prefer double negation on alternation, equivalent to

(?<![^,])\d+(?![^,])

in addition being able matches, rid of capturing can improve performance. (thanks adrian pronk example.)

multiple independent conditions

another classic example of when use lookarounds (in particular lookaheads) when want check multiple conditions on input @ same time. want write single regex makes sure our input contains digit, lower case letter, upper case letter, character none of those, , no whitespace (say, password security). without lookarounds you'd have consider permutations of digit, lower case/upper case letter, , symbol. like:

\s*\d\s*[a-z]\s*[a-z]\s*[^0-9a-za_z]\s*|\s*\d\s*[a-z]\s*[a-z]\s*[^0-9a-za_z]\s*|...

those 2 of 24 necessary permutations. if want ensure minimum string length in same regex, you'd have distribute in possible combinations of \s* - becomes impossible in single regex.

lookahead rescue! can use several lookaheads @ beginning of string check of these conditions:

^(?=.*\d)(?=.*[a-z])(?=.*[a-z])(?=.*[^0-9a-za-z])(?!.*\s)

because lookaheads don't consume anything, after checking each condition engine resets beginning of string , can start looking @ next one. if wanted add minimum string length (say 8), append (?=.{8}). simpler, more readable, more maintainable.

important note: not best general approach check these conditions in real setting. if making check programmatically, it's better have 1 regex each condition, , check them separately - let's return more useful error message. however, above necessary, if have fixed framework lets validation supplying single regex. in addition, it's worth knowing general technique, if ever have independent criteria string match.

i hope these examples give better idea of why people use lookarounds. there lot more applications (another classic inserting commas numbers), it's important realise there difference between (?!u) , [^u] , there cases negated character classes not powerful enough @ all.

Club Open

Search This Blog