2016-07-18

selenium+phantomjs抓取动态网页数据

接到fang88的技术考核题，要求如下:

使用python3
只写一个crawl.py文件，传递url参数

url示例：
https://www.lennar.com/new-homes/washington/seattle
https://www.lennar.com/new-homes/texas/houston
爬取后的数据ajax抛给指定的接口。

因为经常会爬一些数据，所以大体看了一下，就一个网址，感觉应该挺简单的。
考虑到平时都用py2的scrapy来做爬虫，今天正好再熟悉下py3，也没有用bs4，因为最近发现lxml的xpath解析挺爽的
直接代码开撸，嘿嘿！

首先，根据要求，需要拿到参数，命令行如下:
python3 crawl.py https://www.lennar.com/new-homes/washington/seattle

if len(argv) != 2:
	print('参数格式错误')
else:
	print(argv[1])

开始分析网址、抓取数据，分析网页内容，发现都是动态数据，因此选择用selenium和phantomjs来抓取

driver = webdriver.PhantomJS(executable_path='/Users/apple/Downloads/phantomjs-2.1.1-macosx/bin/phantomjs')
driver.get(url)
selector = etree.HTML(driver.page_source.encode('utf-8'))
items = selector.xpath('//div[@class="comm-item clearfix"]')
	for item in items:
		temp = item.xpath('./h1/a/text()')
		name = temp[0] if len(temp) > 0 else ''
		temp = item.xpath('./h2/a/text()')
		name = name + ' ' + (temp[0] if len(temp) > 0 else '')
		print(name)

通过上述代码，很快就抓到首页要的内容，接下来准备抓分页，发现是#标签，然后用chrome来network检查元素，查看post的接口，找到相关的，但post的参数太多，所以想用之前做模拟登陆时的模拟点击来实现，代码如下：

1	driver.find_element_by_xpath('//div[@class="sptop"]//a[@class="ir next"]').click()

运行调试，报错了：
ElementNotVisibleException
通过报错，然后按以往的经验，加上延时，发现仍然无效，不明所以之后，尝试其他链接的模拟点击，发现可以。最后选择最原始也是最有效的调用js方法来解决。(如果有哪位高人知道原因，可以告诉我，非常感谢)

1 2	with open('test.html', 'w+') as f: f.write(driver.page_source)

保存下网页，去查看相关的js及变量的内容。通过检查发现调用了两个接口，先做第一个接口，分析参数为页面里js的变量，于是通过js来得到:

1 2	acetParams = driver.execute_script("return facetContextJSON.params") pageState = driver.execute_script("return pageState")

接着通过postman来抓取的header，复制进去，并通过xpath抓取到总页数，通过requests来获取json数据

def get_otherpage(page, facetParams, pageState, headers):
	pageState['pn'] = page
	payload = json.dumps({'searchState':facetParams, 'pageState': pageState})
	url = 'https://cn.lennar.com/Services/REST/Facets.svc/GetFacetResults'
	response = requests.request("POST", url, data=payload, headers=headers)

发现成功获取json结果，接着分析第二个接口，参数是第一个接口的返回部分数据，执行代码如下：

url = "https://cn.lennar.com/Services/Rest/SearchMethods.svc/GetCommunityDetails"
pageState.update({"pt":"C","ic":19,"ss":0,"attr":"No    ne","ius":False})
payload = json.dumps({'facetResults': response.json()['fr'], 'pageState':pageState})
response = requests.request("POST", url, data=payload, headers=headers)
for item in response.json():
	name = item['cnm']
	if item['mcm']:
		name += item['mcm']
	print(name)

到此，所有需求的数据都可以得到了。因为只是测试，所以没有封装成类，也直接使用了postman抓到的cookie.
测试代码在github