[HTTPS-Everywhere] The standard pattern for 'www.'

Whizz Mo https at whizzmo.com
Wed Dec 22 23:00:09 PST 2010


Regex runtimes:
A quick introduction:
https://secure.wikimedia.org/wikipedia/en/wiki/Regular_expressions#Implementations_and_running_times

A little more depth, focusing on larger-set run times:   (From 2007, graphs
may not be indicative of performance of current versions of regex engines)
http://swtch.com/~rsc/regexp/regexp1.html

The idea here is that runtimes (Big-O notation) for some regex engines
increase exponentially with regex pattern length, while others grow
logarithmically.  Since we have (or anticipate having) long-ish regexes, it
might behoove us to know what runtime graph of our regex engine is.
(Long-ish regex example from UOregon.xml -
http://(oregoncis|blackboard|duckweb|hr2|ir|pcs|budgetmotel|brp|libweb|lcb|odt|scholarsbank|wiki|systems\.cs|www2\.lcb|www\.(cs|law|lcb))\.uoregon\.edu/
)

To find this out, we would need a test harness around some of the matching
code....   Or ask one of the original developers.  I'm hoping that someone
with a good grasp of HTTPSEverywhere's internals sees this discussion and
can put the matter to rest.


On Wed, Dec 22, 2010 at 9:15 PM, Drake, Brian <brian2 at drakefamily.tk> wrote:

> Just to be clear about where these details lie, I think HTTPS Everywhere
> just uses the regex engine built into Firefox, but that still leaves the
> question about what calls it makes (see my previous post).
>
> I’ve seen that sort of analysis before, but not for pattern-matching. Can
> you elaborate? I don’t see how it’s possible (but then I have no idea how
> regex engines actually work when optimisations are included).
>
> On Wed, Dec 22, 2010 at 2101 (UTC-8), Whizz Mo <https at whizzmo.com> wrote:
>
>> Disclosure:  I haven't the faintest notion about the exact algorithmic
>> details of the NoScript/HTTPSEverywhere engine.
>>
>> In broad strokes, a naive pattern matcher coupled with a simple tree
>> structure *might *be able to check matches in logarithmic (not linear)
>> time, and *should *stop processing string characters as soon as a
>> determinate match has been made (i.e. leaf node has been reached).  Throwing
>> regex into the mix adds considerable power and flexibility, though usually
>> at some expense to execution time.  I am curious as to the scale of the
>> tradeoff (bytes of memory vs execution cycles).  With Mozilla making
>> concerted efforts to bring Firefox's speed into the running with Chrome, it
>> might be worth a look.
>>
>> For a marginally-relevant, holiday-themed illustration, try XKCD:
>> https://www.xkcd.com/835/
>>
>>
>>
>> On Wed, Dec 22, 2010 at 2030 (UTC-8), Drake, Brian <brian2 at drakefamily.tk
>> > wrote:
>>
>>> How does this work? Will it continue to check every rule, even if it has
>>> already found a match? I assume that it’s just calling String.replace(), so
>>> the answer is yes. In that case, I don’t see how the naive patterns can
>>> possibly be faster.
>>>
>>> Still, someone should look into it.
>>>
>>> On Wed, Dec 22, 2010 at 2016 (UTC-8), Whizz Mo <https at whizzmo.com>wrote:
>>>
>>>> At the risk of bringing up space-time tradeoff, has anyone examined the
>>>> engine's regex vs naive pattern matching speed?
>>>>
>>>> On Wed, Dec 22, 2010 at 1841 (UTC-8), Drake, Brian <
>>>> brian2 at drakefamily.tk> wrote:
>>>>
>>>>> That sounds good to me. I use far more complex “regexy” patterns than
>>>>> that. That’s what regexp is for, isn’t it?
>>>>>
>>>>> On Wed, Dec 22, 2010 at 1041 (UTC-8), Osama Khalid <osamak at gnu.org>wrote:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> Most rules use "(www\.)?" to match URLs with and without the 'www.'
>>>>>> prefix but few (~57 vs. 394) have two different patterns for each
>>>>>> case.
>>>>>>
>>>>>> I suggest changing the few patterns to the regexy way.
>>>>>>
>>>>>> Should I send a patch?
>>>>>>
>>>>>> --Osama Khalid[snip]
>>>>>>
>>>>>
>>>>> --
>>>>> Brian Drake
>>>>> [snip]
>>>>>
>>>>
>>> --
>>> Brian Drake
>>> [snip]
>>>
>>
> --
> Brian Drake
>
> Alternate (slightly less secure) e-mail: brian at drakefamily.tk
> Alternate (old) e-mail: brianriab at gmail.com
>
> Facebook profile: Profile ID 100001206642672<https://ssl.facebook.com/profile.php?id=100001206642672>
> Twitter username: BrianJDrake <https://twitter.com/BrianJDrake>
> Wikimedia project username: Brianjd<https://secure.wikimedia.org/wikipedia/meta/wiki/User:Brianjd>(been inactive for a while)
>
> All content created by me Copyright © 2010 Brian Drake. All rights
> reserved.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.eff.org/pipermail/https-everywhere/attachments/20101222/757feadb/attachment.html>


More information about the HTTPS-everywhere mailing list