Regex runtimes:<br><div style="margin-left: 40px;">A quick introduction:<br></div><div style="margin-left: 80px;"><a href="https://secure.wikimedia.org/wikipedia/en/wiki/Regular_expressions#Implementations_and_running_times">https://secure.wikimedia.org/wikipedia/en/wiki/Regular_expressions#Implementations_and_running_times</a><br>

</div><div style="margin-left: 40px;"><br>A little more depth, focusing on larger-set run times:   (From 2007, graphs may not be indicative of performance of current versions of regex engines)<br><div style="margin-left: 40px;">

<a href="http://swtch.com/~rsc/regexp/regexp1.html">http://swtch.com/~rsc/regexp/regexp1.html</a><br></div></div><div style="margin-left: 40px;"><br></div>The idea here is that runtimes (Big-O notation) for some regex engines increase exponentially with regex pattern length, while others grow logarithmically.  Since we have (or anticipate having) long-ish regexes, it might behoove us to know what runtime graph of our regex engine is.  (Long-ish regex example from UOregon.xml -  http://(oregoncis|blackboard|duckweb|hr2|ir|pcs|budgetmotel|brp|libweb|lcb|odt|scholarsbank|wiki|systems\.cs|www2\.lcb|www\.(cs|law|lcb))\.uoregon\.edu/   )<br>

<br>To find this out, we would need a test harness around some of the matching code....   Or ask one of the original developers.  I'm hoping that someone with a good grasp of HTTPSEverywhere's internals sees this discussion and can put the matter to rest.  <br>

<br><br><div class="gmail_quote">On Wed, Dec 22, 2010 at 9:15 PM, Drake, Brian <span dir="ltr"><<a href="mailto:brian2@drakefamily.tk">brian2@drakefamily.tk</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">

Just to be clear about where these details lie, I think HTTPS Everywhere just uses the regex engine built into Firefox, but that still leaves the question about what calls it makes (see my previous post).<br><br>I’ve seen that sort of analysis before, but not for pattern-matching. Can you elaborate? I don’t see how it’s possible (but then I have no idea how regex engines actually work when optimisations are included).<br>


<br><div class="gmail_quote"><div class="im">On Wed, Dec 22, 2010 at 2101 (UTC-8), Whizz Mo <span dir="ltr"><<a href="mailto:https@whizzmo.com" target="_blank">https@whizzmo.com</a>></span> wrote:<br></div><blockquote class="gmail_quote" style="margin:0pt 0pt 0pt 0.8ex;border-left:1px solid rgb(204, 204, 204);padding-left:1ex">

<div class="im">

Disclosure:  I haven't the faintest notion about the exact algorithmic details of the NoScript/HTTPSEverywhere engine.  <br><br>In broad strokes, a naive pattern matcher coupled with a simple tree structure <i>might </i>be able to check matches in logarithmic (not linear) time, and <i>should </i>stop processing string characters as soon as a determinate match has been made (i.e. leaf node has been reached).  Throwing regex into the mix adds considerable power and flexibility, though usually at some expense to execution time.  I am curious as to the scale of the tradeoff (bytes of memory vs execution cycles).  With Mozilla making concerted efforts to bring Firefox's speed into the running with Chrome, it might be worth a look.  <br>


<br>For a marginally-relevant, holiday-themed illustration, try XKCD:   <a href="https://www.xkcd.com/835/" target="_blank">https://www.xkcd.com/835/</a></div><div><div></div><div class="h5"><div><div></div><div><br><br>

<br><div class="gmail_quote">

On Wed, Dec 22, 2010 at 2030 (UTC-8), Drake, Brian <span dir="ltr"><<a href="mailto:brian2@drakefamily.tk" target="_blank">brian2@drakefamily.tk</a>></span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0pt 0pt 0pt 0.8ex;border-left:1px solid rgb(204, 204, 204);padding-left:1ex">How does this work? Will it continue to check every rule, even if it has already found a match? I assume that it’s just calling String.replace(), so the answer is yes. In that case, I don’t see how the naive patterns can possibly be faster.<br>


<br>Still, someone should look into it.<br><br><div class="gmail_quote"><div>On Wed, Dec 22, 2010 at 2016 (UTC-8), Whizz Mo <span dir="ltr"><<a href="mailto:https@whizzmo.com" target="_blank">https@whizzmo.com</a>></span> wrote:<br>


</div><blockquote class="gmail_quote" style="margin:0pt 0pt 0pt 0.8ex;border-left:1px solid rgb(204, 204, 204);padding-left:1ex"><div>

At the risk of bringing up space-time tradeoff, has anyone examined the engine's regex vs naive pattern matching speed?  <br><br></div><div><div class="gmail_quote"><div><div></div><div>On Wed, Dec 22, 2010 at 1841 (UTC-8), Drake, Brian <span dir="ltr"><<a href="mailto:brian2@drakefamily.tk" target="_blank">brian2@drakefamily.tk</a>></span> wrote:<br>


</div></div><blockquote class="gmail_quote" style="margin:0pt 0pt 0pt 0.8ex;border-left:1px solid rgb(204, 204, 204);padding-left:1ex"><div><div></div><div>That sounds good to me. I use far more complex “regexy” patterns than that. That’s what regexp is for, isn’t it?<br>


<br>

<div class="gmail_quote"><div>On Wed, Dec 22, 2010 at 1041 (UTC-8), Osama Khalid <span dir="ltr"><<a href="mailto:osamak@gnu.org" target="_blank">osamak@gnu.org</a>></span> wrote:<br>

</div><blockquote class="gmail_quote" style="margin:0pt 0pt 0pt 0.8ex;border-left:1px solid rgb(204, 204, 204);padding-left:1ex"><div>Hello,<br>

<br>

Most rules use "(www\.)?" to match URLs with and without the 'www.'<br>

prefix but few (~57 vs. 394) have two different patterns for each<br>

case.<br>

<br>

I suggest changing the few patterns to the regexy way.<br>

<br>

Should I send a patch?<br>

</div><font color="#888888"><br>

--Osama Khalid[snip]</font><br></blockquote></div><br>--<br>Brian Drake<br clear="all">[snip]</div></div></blockquote></div></div></blockquote></div><div><br>--<br>Brian Drake<br>[snip]</div></blockquote></div></div></div>


</div></div></blockquote></div><div><div></div><div class="h5"><br>--<br>Brian Drake<br><br>Alternate (slightly less secure) e-mail: <a href="mailto:brian@drakefamily.tk" target="_blank">brian@drakefamily.tk</a><br>Alternate (old) e-mail: <a href="mailto:brianriab@gmail.com" target="_blank">brianriab@gmail.com</a><br>


<br>Facebook profile: <a href="https://ssl.facebook.com/profile.php?id=100001206642672" target="_blank">Profile ID 100001206642672</a><br>

Twitter username: <a href="https://twitter.com/BrianJDrake" target="_blank">BrianJDrake</a><br>Wikimedia project username: <a href="https://secure.wikimedia.org/wikipedia/meta/wiki/User:Brianjd" target="_blank">Brianjd</a> (been inactive for a while)<br>


<br>All content created by me Copyright © 2010 Brian Drake. All rights reserved.<br>

</div></div></blockquote></div><br>