2013-04-24

How to scrape an obfuscated site? (一)


## Problem

How to scrape an obfuscated site (such as `www.hidemyass.com`)

As you can see, random html tags are injected. I break it up into multiple lines with indentation.
You need to clean them up to see the real ip-address (displayed as `88.200.222.238`).

> Some people (including me), when confronted with a problem, think
> “I know, I'll use regular expressions.”   Now they have two problems. 

    s = '''<span>
        <style>
            .n8jQ{display:none}
            .p1Qr{display:inline}
            .E3lv{display:none}
            .I0ja{display:inline}
            .oRy_{display:none}
            .FYOA{display:inline}
            .oldO{display:none}
            .NQ2o{display:inline}
        </style>
        <span class="n8jQ">54</span>
        <span></span>
        <div style="display:none">60</div>
        <span class="p1Qr">88</span>
        <span style="display:none">143</span>
        <span class="oRy_">143</span>
        <span></span>
        <span class="n8jQ">160</span>
        <div style="display:none">160</div>
        .
        <span style="display:none">41</span>
        <span class="oldO">41</span>
        <div style="display:none">41</div>
        <span class="NQ2o">200</span>
        <span class="I0ja">.</span>
        <span style="display:none">27</span>
        <span style="display:none">63</span>
        <div style="display:none">63</div>
        <span style="display:none">178</span>
        <span style="display:none">191</span>
        <span class="47">222</span>
        .
        <div style="display:none">34</div>
        <span style="display:none">45</span>
        <span class="n8jQ">45</span>
        <span class="oldO">229</span>
        <span></span>
        <span style="display: inline">238</span>
    </span>'''

## Solution

    def parse_ipaddr(s):
        # normalize tags
        txt = re.sub(r'\bdiv\b', 'span', s)
        txt = re.sub(r'(?<=>)\s*([.0-9]+)\s*((?=<)(?!</)|(?=</span>$))', r'<span style="display:inline">\g<1></span>', txt)
    
        # extract style sheet
        css = {}
        l, r = s.find('<style>'), s.rfind('</style>')
        for i in s[l+7:r].strip().splitlines():
            m = re.search(r'\.(?P<key>[^{]+)\{display:(?P<val>none|inline)\}', i)
            if m:
                d = m.groupdict()
                css[d['key']] = d['val'] == 'inline'
    
        # collect ip parts
        ip_parts = []
        for j in re.findall(r'<span (class|style)="([^"]+)">([^<>]+)</span>', txt):
            if j[0]=='class' and css.get(j[1], True):
                ip_parts.append(j[2])
            elif j[0]=='style' and 'inline' in j[1]:
                ip_parts.append(j[2])
            else:
                pass
    
        return ''.join(ip_parts)

## Result

    >>> parse_ipaddr(s)
    '88.200.222.238'

## Links

- http://www.hidemyass.com/proxy-list/
- http://regex.info/blog/2006-09-15/247