## Problem
How to scrape an obfuscated site (such as `www.hidemyass.com`)
As you can see, random html tags are injected. I break it up into multiple lines with indentation.
You need to clean them up to see the real ip-address (displayed as `88.200.222.238`).
> Some people (including me), when confronted with a problem, think
> “I know, I'll use regular expressions.” Now they have two problems.
s = '''<span>
<style>
.n8jQ{display:none}
.p1Qr{display:inline}
.E3lv{display:none}
.I0ja{display:inline}
.oRy_{display:none}
.FYOA{display:inline}
.oldO{display:none}
.NQ2o{display:inline}
</style>
<span class="n8jQ">54</span>
<span></span>
<div style="display:none">60</div>
<span class="p1Qr">88</span>
<span style="display:none">143</span>
<span class="oRy_">143</span>
<span></span>
<span class="n8jQ">160</span>
<div style="display:none">160</div>
.
<span style="display:none">41</span>
<span class="oldO">41</span>
<div style="display:none">41</div>
<span class="NQ2o">200</span>
<span class="I0ja">.</span>
<span style="display:none">27</span>
<span style="display:none">63</span>
<div style="display:none">63</div>
<span style="display:none">178</span>
<span style="display:none">191</span>
<span class="47">222</span>
.
<div style="display:none">34</div>
<span style="display:none">45</span>
<span class="n8jQ">45</span>
<span class="oldO">229</span>
<span></span>
<span style="display: inline">238</span>
</span>'''
## Solution
def parse_ipaddr(s):
# normalize tags
txt = re.sub(r'\bdiv\b', 'span', s)
txt = re.sub(r'(?<=>)\s*([.0-9]+)\s*((?=<)(?!</)|(?=</span>$))', r'<span style="display:inline">\g<1></span>', txt)
# extract style sheet
css = {}
l, r = s.find('<style>'), s.rfind('</style>')
for i in s[l+7:r].strip().splitlines():
m = re.search(r'\.(?P<key>[^{]+)\{display:(?P<val>none|inline)\}', i)
if m:
d = m.groupdict()
css[d['key']] = d['val'] == 'inline'
# collect ip parts
ip_parts = []
for j in re.findall(r'<span (class|style)="([^"]+)">([^<>]+)</span>', txt):
if j[0]=='class' and css.get(j[1], True):
ip_parts.append(j[2])
elif j[0]=='style' and 'inline' in j[1]:
ip_parts.append(j[2])
else:
pass
return ''.join(ip_parts)
## Result
>>> parse_ipaddr(s)
'88.200.222.238'
## Links
- http://www.hidemyass.com/proxy-list/
- http://regex.info/blog/2006-09-15/247