Donnerstag, 13. November 2014

About Proxy PAC Systems and RegEx

Different proxy systems

When I browse the web I got a few factors which are important to me:
  1. My traffic should be encrypted
  2. No webserver should see my real IP
  3. I dont want any company to collect data about my surf behaviour
  4. I dont want to be restricted from content only avaible to certain countrys
Today nearly every application has a bunch of options dedicated to routing the traffic over a proxy which makes the setup very easy:



Most tools offer us options for different proxy protocols including Socks, HTTPS aswell HTTP.
Having the possibility to set only one static proxy is a nice start but what are we supposed to do if we want to use different proxys without manually switching them each time?

The answer is pretty easy: PAC (Proxy auto-config)
Proxy auto-config is a proxy mode which evaluates the site we are visiting and chooses a proxy depending on user specified conditions.
The evaluation process is defined within a .pac file which we point our browser to. The code is written in JavaScript:

 function FindProxyForURL(url, host)  
 {  
      return "DIRECT";  
 }  



Everytime our browser is connecting to a page it will call the FindProxyForURL with the URL aswell the host as parameters. Lets make an example:
1. We are visiting http://www.youtube.com/watch?v=Mh_cLN66zHA

2. Our browser will call FindProxyForURL with the following parameters:
URL: http://www.youtube.com/watch?v=Mh_cLN66zHA
Host: www.youtube.com

3. The return value of FindProxyForURL will define how we connect to the page: 
"DIRECT" -> make a direct connection
"PROXY 127.0.0.1:81" -> Connect over http proxy at 127.0.0.1:81
"SOCKS5 127.0.0.1:1028" -> Socks5 at 127.0.0.1:1028
etc.

4. Using the above script we will connect directly dont matter the URL or host. Later on I will show how to extend the script.

Creating an extended pac

If I visit YouTube my traffic is routed over an american server without any encryption since videos arent getting blocked because of copyright over there. Encryption isnt important since the information being transfered to youtube isnt sensitive and it would only slow down the connection.

For every other kind of connection I use an encrypted tunnel to a server located in the Netherlands since their internet policy is very tolerant.

1. Recognise a url which belongs to YouTube:

We could check if the url contains "youtube" but the result wouldnt be very accurate:
https://www.google.de/?gws_rd=ssl#q=youtube would be recognised as youtube for example.

Instead of just searching for a string we take advantage of regular expressions. A few useful links about RegEx:
http://en.wikipedia.org/wiki/Regular_expression
http://regexlib.com/CheatSheet.aspx
http://www.regexr.com/

Dont be intimidated by the first look. Its actually a good understandable topic. To give you a little push in the right direction we will define a pattern which finds youtube urls:

 ^https?:\/\/(www\.)?([^/]+\.)?youtube\.com  

^ represents the start of a string. In conclusion our string has to start with http
a ? after brackets or a single character states that the bracket content or the character can occur one time but dont need to.
Our string can begin with either:
  • http
  • https


backslashes (\) are used to escape characters which would have another meaning unescaped.
backslash-slash (\/) represents a slash (/) for example.
Going from that our matched string can be:

  • http://
  • https://
(www\.)? states that www. CAN occur one time but dont need to. Our possibilities:
  • http://
  • http://www.
([^\]+\.)?:
Like we previously learned ? means that the bracket content can occur up to one time.
[^\] represents any character but backslash (\).
The + after [^\] states that we can have one to unlimited characters which arent backslahes.
[^\]+ represents:
  • ...
  • a
  • aa
  • aaa
  • abc
  • ...
but it doesnt represent:
  • ...
  • \
  • a\
  • aa\
  • aaa\
  • ...
In conclusion: We can have up to one substring which includes unlimited characters that arent backslashes and ends on a dot (.).

A few possible matches for "^https?:\/\/(www\.)?([^/]+\.)?":
  • http://www.aaa.
  • https://www.a.
  • http://aaaaaa.
  • https://aa.
Remaining part of our RegEx is just youtube\.com which means that youtube.com has to follow. No rules here. Simple as that.

It might be a bit complicated at the first look however you can trial and error yourself through this topic using the previously linked RegEx cheat sheet aswell the RegEx tester. Every rule I used in my expression is also explained on the sheet (and yes I never used RegEx before so my expression might not be perfect).

Having explained a sample expression I continue defining two RegExp objects inside my pac:

 var youtubeRegEx = new RegExp("^https?:\/\/(www\.)?([^/]+\.)?youtube\.com", "i");  
 var ytStreamRegEx = new RegExp("^http:\/\/[^/]+\.googlevideo\.com", "i");  

The second parameter "i" makes the expression case insensitive (a = A).

2. Extend FindProxyForURL(url, host)


 function FindProxyForURL(url, host)  
 {  
      if (youtubeRegEx.test(url) || ytStreamRegEx.test(url))  
      {  
           alert("YT proxy for URL: " + url);  
           return "SOCKS5 X.X.X.X:1080";  
      }  
      return "SOCKS5 127.0.0.1:5555";  
 }  

test() is a function from RegExp objects taking a string parameter. If the expression can be found inside the given string (in our case the url) the function will return true:
If the current url belongs to youtube our browser will use the socks5 at X.X.X.X:1080 (the american proxy server)
Otherwise it will use my local SSL-Socks5 (replaceable with DIRECT for a direct connection).

3. Verify our pac-script

Debugging a pac script can be very annoying and isnt even possible in every browser. For SrWare Iron (or better said Chrome) you can do the following:

If you scroll up you will see that we call alert() inside the if brackets. This will give us some debug information:


Chromium related browsers will change the system proxy settings by default. You should get a proxy extension like "ProxySwitchy Sharp" or "Proxy Helper" so the changes are only applied to Chrome / SrWare Iron.