## Problem
How to scrape an obfuscated site (such as `spys.ru`).
This site try to use javascript variables(which are generated randomly) to stop the scraper.
I don't care about what math it use. I just translate the javascript to python code word by word.
What I learned is that, I should refactor the obfuscated javascript code at the first place.
## Solution
### javascript code
eval(
function(p, x){
var r = o = 60;
var s = {};
x = x.split('^');
function y(c){
return (c35 ? String.fromCharCode(c+29) : c.toString(36));
};
while(o--){
s[y(o)] = x[o] || y(o);
}
return p.replace(new RegExp('\\b\\w+\\b','g'), function(y){return s[y]});
}(
'i=D^C;s=B^E;d=F^A;b=3;m=H^G;r=7;n=8;p=9;c=4;e=5;q=J^y;o=u^x;h=1;l=z^w;a=v^I;j=0;k=2;f=S^V;t=6;g=R^Q;K=j^l;M=h^i;U=k^g;P=b^a;O=c^d;N=e^m;L=t^s;X=r^q;W=n^o;T=p^f;',
'^^^^^^^^^^TwoTwoSeven^Six^Two^Seven2One^Eight^ThreeSixZero^SevenEightSix^Seven^Eight9Five^Five^One^NineSevenNine^Four3Three^Three^Nine5Four^Four^EightOneTwo^Nine^SevenNineEight^Zero^6940^2561^80^8085^88^5703^8000^9276^1337^10079^9090^3313^8909^4852^808^2581^SixFourOneNine^Four2FourFive^TwoEightSixSix^ThreeSevenFiveThree^NineOneNineEight^Zero0ThreeFour^1080^9943^5391^OneTwoTwoOne^FiveOneZeroTwo^8118^Nine1EightSeven^One3SevenZero'
)
)
### python code
#!/usr/bin/env python
# crack
# by Kev++@2013-05-03T18:35:06
import string, re
from pprint import pprint
def num_to_str(x, r, tbl=string.digits+string.lowercase):
return ((x==0) and tbl[0]) or (num_to_str(x//r, r, tbl).lstrip(tbl[0])+tbl[x%r])
def build_lookup_table(p, x):
r = o = 60
s = {}
x = x.split('^')
def y(c):
return ('' if c35 else num_to_str(c%r, 36))
for i in range(o):
s[y(i)] = x[i] or y(i)
p = re.sub(r'\b\w+\b', lambda m: s[m.group(0)], p)
tbl = dict()
for i in p.strip(';').split(';'):
k, v = i.split('=')
tbl[k] = reduce(lambda x,y: x^y, [int(tbl.get(j, j)) for j in v.split('^')])
return tbl
pprint(build_lookup_table(
'i=D^C;s=B^E;d=F^A;b=3;m=H^G;r=7;n=8;p=9;c=4;e=5;q=J^y;o=u^x;h=1;l=z^w;a=v^I;j=0;k=2;f=S^V;t=6;g=R^Q;K=j^l;M=h^i;U=k^g;P=b^a;O=c^d;N=e^m;L=t^s;X=r^q;W=n^o;T=p^f;',
'^^^^^^^^^^TwoTwoSeven^Six^Two^Seven2One^Eight^ThreeSixZero^SevenEightSix^Seven^Eight9Five^Five^One^NineSevenNine^Four3Three^Three^Nine5Four^Four^EightOneTwo^Nine^SevenNineEight^Zero^6940^2561^80^8085^88^5703^8000^9276^1337^10079^9090^3313^8909^4852^808^2581^SixFourOneNine^Four2FourFive^TwoEightSixSix^ThreeSevenFiveThree^NineOneNineEight^Zero0ThreeFour^1080^9943^5391^OneTwoTwoOne^FiveOneZeroTwo^8118^Nine1EightSeven^One3SevenZero'
))
## Result
{'Eight': 5,
'Eight9Five': 8806,
'EightOneTwo': 2637,
'Five': 0,
'FiveOneZeroTwo': 8941,
'Four': 9,
'Four2FourFive': 1976,
'Four3Three': 12345,
'Nine': 7,
'Nine1EightSeven': 1153,
'Nine5Four': 1161,
'NineOneNineEight': 5045,
'NineSevenNine': 5655,
'One': 2,
'One3SevenZero': 2634,
'OneTwoTwoOne': 2736,
'Seven': 1,
'Seven2One': 5041,
'SevenEightSix': 8943,
'SevenNineEight': 1982,
'Six': 3,
'SixFourOneNine': 5655,
'Three': 8,
'ThreeSevenFiveThree': 12348,
'ThreeSixZero': 2745,
'Two': 4,
'TwoEightSixSix': 8807,
'TwoTwoSeven': 2345,
'Zero': 6,
'Zero0ThreeFour': 2346}
## Links
- http://spys.ru/free-proxy-list/CN/
2013-05-03
How to scrape an obfuscated site? (二)
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment