[HTTPS-E Rulesets] Wildcard confusion and data structure question

Thu Aug 11 15:51:26 PDT 2011

On Thu, Aug 11, 2011 at 02:58:15PM -0700, Adam Fisk wrote:
> Clearly
> 
> <target host="google.*" />
> 
> covers a superset of
> 
> <target host="google.co.*" />

Unlike other portions of the rulset, these are not regular expressions.  A
* matches a number of characters that are not ".".  As a special case, a * at
the leftmost end matches things that include ".".

> 
> I don't understand why the rule set includes both.
> 
> I'm also curious about what data structure you're using for those
> wildcard searches at the end. Are you using prefix matching with a
> Trie? How are you able to do this efficiently without iterating and
> regexing every target for every request?

Hi Adam:

The data structure is simply a JavaScript object (ie, a hash table).  Each
target host XML element creates and entry in that table.  When a request is
made to www.any.thing.com, we check for the following things in the
table:

www.any.thing.com
*.any.thing.com
www.*.thing.com
www.any.*.com
www.any.thing.*
*.thing.com
*.com
*

Any ruleset that registered itself with one of these target hosts will be
considered "potentially applicable", and the regular expressions in its rules
will be applied to this request.

The implementation is in the potentiallyApplicableRulesets function in this file:

https://gitweb.torproject.org/https-everywhere.git/blob/HEAD:/src/chrome/content/code/HTTPSRules.js

-- 
Peter Eckersley                            pde at eff.org
Technology Projects Director      Tel  +1 415 436 9333 x131
Electronic Frontier Foundation    Fax  +1 415 436 9993