2013-05-03

How to scrape an obfuscated site? (二)



## Problem

How to scrape an obfuscated site (such as `spys.ru`).
This site try to use javascript variables(which are generated randomly) to stop the scraper.
I don't care about what math it use. I just translate the javascript to python code word by word.
What I learned is that, I should refactor the obfuscated javascript code at the first place.

## Solution

### javascript code

    eval(
        function(p, x){
        
            var r = o = 60;
            var s = {};
            x = x.split('^');
        
            function y(c){
                return (c35 ? String.fromCharCode(c+29) : c.toString(36));
            };
        
            while(o--){
                s[y(o)] = x[o] || y(o);
            }
        
            return p.replace(new RegExp('\\b\\w+\\b','g'), function(y){return s[y]});
        }(
            'i=D^C;s=B^E;d=F^A;b=3;m=H^G;r=7;n=8;p=9;c=4;e=5;q=J^y;o=u^x;h=1;l=z^w;a=v^I;j=0;k=2;f=S^V;t=6;g=R^Q;K=j^l;M=h^i;U=k^g;P=b^a;O=c^d;N=e^m;L=t^s;X=r^q;W=n^o;T=p^f;',
            '^^^^^^^^^^TwoTwoSeven^Six^Two^Seven2One^Eight^ThreeSixZero^SevenEightSix^Seven^Eight9Five^Five^One^NineSevenNine^Four3Three^Three^Nine5Four^Four^EightOneTwo^Nine^SevenNineEight^Zero^6940^2561^80^8085^88^5703^8000^9276^1337^10079^9090^3313^8909^4852^808^2581^SixFourOneNine^Four2FourFive^TwoEightSixSix^ThreeSevenFiveThree^NineOneNineEight^Zero0ThreeFour^1080^9943^5391^OneTwoTwoOne^FiveOneZeroTwo^8118^Nine1EightSeven^One3SevenZero'
        )
    )

### python code

    #!/usr/bin/env python
    # crack 
    # by Kev++@2013-05-03T18:35:06
    
    import string, re
    from pprint import pprint
    
    def num_to_str(x, r, tbl=string.digits+string.lowercase):
        return ((x==0) and tbl[0]) or (num_to_str(x//r, r, tbl).lstrip(tbl[0])+tbl[x%r])
    
    def build_lookup_table(p, x):
    
        r = o = 60
        s = {}
        x = x.split('^')
    
        def y(c):
            return ('' if c35 else num_to_str(c%r, 36))
    
        for i in range(o):
            s[y(i)] = x[i] or y(i)
    
        p = re.sub(r'\b\w+\b', lambda m: s[m.group(0)], p)
    
        tbl = dict()
    
        for i in p.strip(';').split(';'):
            k, v = i.split('=')
            tbl[k] = reduce(lambda x,y: x^y, [int(tbl.get(j, j)) for j in v.split('^')])
    
        return tbl
    
    pprint(build_lookup_table(
        'i=D^C;s=B^E;d=F^A;b=3;m=H^G;r=7;n=8;p=9;c=4;e=5;q=J^y;o=u^x;h=1;l=z^w;a=v^I;j=0;k=2;f=S^V;t=6;g=R^Q;K=j^l;M=h^i;U=k^g;P=b^a;O=c^d;N=e^m;L=t^s;X=r^q;W=n^o;T=p^f;',
        '^^^^^^^^^^TwoTwoSeven^Six^Two^Seven2One^Eight^ThreeSixZero^SevenEightSix^Seven^Eight9Five^Five^One^NineSevenNine^Four3Three^Three^Nine5Four^Four^EightOneTwo^Nine^SevenNineEight^Zero^6940^2561^80^8085^88^5703^8000^9276^1337^10079^9090^3313^8909^4852^808^2581^SixFourOneNine^Four2FourFive^TwoEightSixSix^ThreeSevenFiveThree^NineOneNineEight^Zero0ThreeFour^1080^9943^5391^OneTwoTwoOne^FiveOneZeroTwo^8118^Nine1EightSeven^One3SevenZero'
    ))


## Result

    {'Eight': 5,
     'Eight9Five': 8806,
     'EightOneTwo': 2637,
     'Five': 0,
     'FiveOneZeroTwo': 8941,
     'Four': 9,
     'Four2FourFive': 1976,
     'Four3Three': 12345,
     'Nine': 7,
     'Nine1EightSeven': 1153,
     'Nine5Four': 1161,
     'NineOneNineEight': 5045,
     'NineSevenNine': 5655,
     'One': 2,
     'One3SevenZero': 2634,
     'OneTwoTwoOne': 2736,
     'Seven': 1,
     'Seven2One': 5041,
     'SevenEightSix': 8943,
     'SevenNineEight': 1982,
     'Six': 3,
     'SixFourOneNine': 5655,
     'Three': 8,
     'ThreeSevenFiveThree': 12348,
     'ThreeSixZero': 2745,
     'Two': 4,
     'TwoEightSixSix': 8807,
     'TwoTwoSeven': 2345,
     'Zero': 6,
     'Zero0ThreeFour': 2346}

## Links

- http://spys.ru/free-proxy-list/CN/