技术标签: 爬虫 python python爬虫 python爬虫实战 汽车之家爬虫
1、将爬取分为爬取流程和内容解析
1)爬取流程控制请求在售、即将销售、停售的请求分发
2) 内容解析负责当前页面的循环解析和分页请求
<div class="tab-nav border-t-no">
<!--状态tab、排序-->
<div class="brandtab-cont-sort">
<a href="/price/series-4741-0-2-0-0-0-0-1.html" class="ma-r15 current">最热门<i class="icon10 icon10-up"></i></a>
<a href="/price/series-4741-1-2-0-0-0-0-1.html">按价格<span class="icon-cont"><i class="icon10 icon10-sjt"></i><i class="icon10 icon10-sjb"></i></span></a>
</div>
<ul data-trigger="click">
<li class="disabled"><span title="在售是指官方已经公布售价且正式在国内销售的车型">在售</span></li>
<li class="current"><a href="/price/series-4741-0-2-0-0-0-0-1.html" data-toggle="tab" data-target="#brandtab-2" rel="nofollow" target="_self" title="即将销售是指近期即将在国内销售的车型">即将销售</a></li>
<li class="disabled"><span title="停售是指厂商已停产并且经销商处已无新车销售的车型">停售</span></li>
</ul>
</div>
<ul class="interval01-list">
<li data-value="33885">
<div class="interval01-list-cars">
<div class="interval01-list-cars-infor">
<p id="p33885"><a href="//www.autohome.com.cn/spec/33885/#pvareaid=2042128" target="_blank">2018款 基本型</a></p>
<p></p>
<p><span></span><span></span></p>
</div>
</div>
<div class="interval01-list-attention">
<div class="attention">
<span id="spgzd33885" class="attention-value" style="width:100%"></span>
</div>
</div>
<div class="interval01-list-guidance">
<div>
<a href="//j.autohome.com.cn/pcplatform/staticpage/loan/index.html?specid=33885&pvareaid=2020994" target="_blank" title="购车费用计算"><i class="icon16 icon16-calendar"></i></a> 16.99万
</div>
</div>
<div class="interval01-list-lowest">
<div>
<span class="red price-link">暂无报价</span>
<a class="js-xj btn btn-mini btn-blue btn-disabled" id="pxj33885" name="pxjs4741" target="_blank">询价</a>
</div>
</div>
<div class="interval01-list-related">
<div>
<span id="spspk33885">口碑</span>
<a class="fn-hide" target="blank" id="spk33885">口碑</a>
<a href="/pic/series-s33885/4741.html#pvareaid=100678" target="_blank">图片</a>
<span id="spsps33885">视频</span>
<a class="fn-hide" target="blank" id="sps33885">视频</a>
<span>配置</span>
</div>
</div></li>
</ul>
<div class="page">
<a class="page-item-prev page-disabled" href="javascript:void(0)">上一页</a>
<a href="javascript:void(0);" class="current">1</a>
<a href="/price/series-65-0-3-0-0-0-0-2.html">2</a>
<a href="/price/series-65-0-3-0-0-0-0-3.html">3</a>
<a href="/price/series-65-0-3-0-0-0-0-4.html">4</a>
<a href="/price/series-65-0-3-0-0-0-0-5.html">5</a>
<a href="/price/series-65-0-3-0-0-0-0-6.html">6</a>
<a class="page-item-next" href="/price/series-65-0-3-0-0-0-0-2.html">下一页</a>
</div>
由于指导价和评分在车型详情页,因此需要在爬取车型配置的时候更新车型表
<span class="factoryprice">厂商指导价:12.08<em>万元</em></span>
<div class="spec-content">
<div class="koubei-con">
<div class="koubei-left">
<div class="koubei-data">
<span>网友评分:<a href="//k.autohome.com.cn/spec/33409/#pvareaid=3454572" class="scroe">4.28分</a></span>
<span>口碑印象:<a href="//k.autohome.com.cn/spec/33409/#pvareaid=3454573" class="count">5人参与评价</a></span>
</div>
<div class="koubei-tags">
<a href="//k.autohome.com.cn/spec/33409/?summarykey=530800&g#pvareaid=3454574" class="athm-tags athm-tags--blue">油耗满意</a>
<a href="//k.autohome.com.cn/spec/33409/?summarykey=457074&g#pvareaid=3454574" class="athm-tags athm-tags--default">胎噪很硬</a>
</div>
</div>
<div class="koubei-right">
<p class="koubei-user"> <span> <a href="javascript:void(0)" id="koubei_user" data-userid="32543097" target="_blank"> </a> <i>发表</i> </span> <span> <b title="2018款 118i 时尚型">2018款 118i 时尚型</b> <i>口碑</i> </span> <span><em>车主已追加口碑</em></span> </p>
<p class="koubei-info"> <span>裸车价:<em>15.9万</em></span> <span>购车时间:<em>2018年3月</em></span> <span> 耗 电 量: <em>暂无</em> </span> </p>
<p class="koubei-list"> <a href="//k.autohome.com.cn/spec/33409/view_2013088_1.html#pvareaid=3454575"> 【最满意的一点】 1、颜值。这个毋庸置疑,尤其是车头真长,显得整车就没有那么小了,而且蓝色虽然已经烂大街了,不过真的...<i>详细 ></i> </a> </p>
</div>
</div>
</div>
<div class="spec-content">
<!-- 空数据 -->
<div class="koubei-blank">
<p>本车型暂无优秀口碑,发表优秀口碑赢丰富好礼</p>
<p><a href="//k.autohome.com.cn/form/carinput/add/31960#pvareaid=3454571" class="athm-btn athm-btn--mini athm-btn--blue-outline">发表口碑</a></p>
</div>
</div>
核心类代码:
import scrapy,pymysql,re
from ..mySqlUtils import MySqlUtils
from ..items import SpecItem,SeriesItem
from ..pipelines import SpecPipeline
#请求车系下的车型列表信息,完善
class specSpider(scrapy.Spider):
name = "specSpider"
https="https:%s"
host="https://car.autohome.com.cn%s"
count=0
ruleId=2 # 爬取策略:1为只爬取数据库中不存在的,2是全部更新
chexingIdSet=None #重数据库查出的已爬取的车型id集合
# 解析车型列表数据,并保存到数据库
def parseSpec(self, response):
# 解析
seriesParams = response.meta['seriesParams']
specList=self.extractSpecItem(response)
# 保存到数据库
for specItem in specList:
yield specItem
#当前页面解析完成后判断是否存在分页,若存在分页则继续请求分页链接,再解析到本方法中
pageData = response.css(".page")
if pageData:
#取出nextPage
pageList = pageData.xpath("a")
nextPage=pageList[len(pageList) - 1].xpath("@href").extract_first()
#若存在有效的下一页链接,则继续请求
if nextPage.find("java") == -1:
pageLink = self.host % nextPage
request = scrapy.Request(url=pageLink, callback=self.parseSpec)
request.meta['seriesParams'] = seriesParams # (品牌ID,车系id)
yield request
#解析车系车型信息
def parse(self, response):
seriesItem=SeriesItem()
seriesParams=response.meta['seriesParams']
#解析车系概要信息
seriesData = response.css(".lever-ul").xpath("*")
#解析车辆级别
lever=seriesData[0].xpath("string(.)").extract_first() #'级\xa0\xa0别:中型SUV'
lever=lever.split(":")[1].strip()
# 解析指导价
minPrice = 0
maxPrice = 0
seriesDataRight = response.css(".main-lever-right").xpath("*")
price = seriesDataRight[0].xpath("span/span/text()").extract_first()
if price.find("-") != -1:
price = price.rstrip("万")
price = price.split("-")
minPrice = price[0]
maxPrice = price[1]
# 解析用户评分
userScore = 0
userScoreStr = seriesDataRight[1].xpath("string(.)").extract_first()
if re.search(r'\d+', userScoreStr) != None:
userScore = userScoreStr.split(":")[1]
#保存车系概要信息到数据库
seriesItem['minMoney']=minPrice
seriesItem['maxMoney']=maxPrice
seriesItem['score']=userScore
seriesItem['jibie']=lever
seriesItem['chexiID']=seriesParams[1]
# self.log(seriesItem)
yield seriesItem
# 解析当前车型页面
specList = self.extractSpecItem(response)
# self.log(specList)
# 保存到数据库
for specItem in specList:
yield specItem
#解析车型概要信息
#爬取逻辑:
# 1、获取在售、即将销售、停售三种状态
# 2、依次判断每个状态是否有值,若有值,则判断当前页面是否为直接进入的
# 3、解析当前状态数据
# 4、当存在分页时继续请求
# 1.1 定义三种链接
sellingLink='-1' #在售
sellWaitLink='-1' #即将销售
sellStopLink='-1' #停售
# 1.2 取出三种状态
statusData = response.css(".tab-nav.border-t-no")
statusList = statusData.xpath("ul/li")
for statusItem in statusList:
status = statusItem.xpath("a")
if status:
statusDes=status.xpath("text()").extract_first()
link=status.xpath("@href").extract_first()
if statusDes == '在售':
sellingLink=link
if statusDes == '即将销售':
sellWaitLink=link
if statusDes == '停售':
sellStopLink=link
# self.log("-------------------------->status")
statusPrint=(sellingLink,sellWaitLink,sellStopLink)
# self.log(statusPrint)
# 2.2 判断即将销售状态
if sellWaitLink != '-1':
#若在售有值则证明不是直接请求过来的,则发起请求,否则就是直接请求过来的直接解析
if sellingLink != '-1':
#发送请求
request = scrapy.Request(url=self.host % sellWaitLink, callback=self.parseSpec)
request.meta['seriesParams'] = seriesParams # (品牌ID,车系id)
yield request
# 2.3 判断停售状态
if sellStopLink != '-1':
#判断在售状态或即将销售状态是否有值,若有值则证明不是直接请求过来的需要请求后才能解析,若有没有值则直接请求过来的,直接解析即可,
if sellingLink != '-1' or sellWaitLink != '-1':
#请求链接
request = scrapy.Request(url=self.host % sellStopLink, callback=self.parseSpec)
request.meta['seriesParams'] = seriesParams # (品牌ID,车系id)
yield request
else:
# 判断是否存在分页,若存在则继续请求
pageData = response.css(".page")
if pageData:
# 取出nextPage
pageList = pageData.xpath("a")
nextPage = pageList[len(pageList) - 1].xpath("@href").extract_first()
# 若存在有效的下一页链接,则继续请求
if nextPage.find("java") == -1:
pageLink = self.host % nextPage
request = scrapy.Request(url=pageLink, callback=self.parseSpec)
request.meta['seriesParams'] = seriesParams # (品牌ID,车系id)
yield request
def start_requests(self):
self.chexingIdSet=MySqlUtils.parseToChexingIdSet(MySqlUtils.querySpec())
# 读取数据库车系表,获取访问车系车型链接
seriesItems = MySqlUtils.querySeriesLink()
# seriesItems=["https://car.autohome.com.cn/price/series-4171.html"] # 测试停售
# seriesItems=["https://car.autohome.com.cn/price/series-4887.html"] # 测试即将销售 具体车型ID:35775
# 从断点处开始爬取
waitingCrawlItems = list()
for id in SpecPipeline.waitingCrawlSeriesIdSet:
for item in seriesItems:
if id == item[1]:
waitingCrawlItems.append(item)
break
#waitingCrawItems=MySqlUtils.findChexiInChexiSet(seriesItems,SpecPipeline.waitingCrawlSeriesIdSet)
for item in waitingCrawlItems:
# 统计已爬取的车系
SpecPipeline.crawledSeriesCount += 1
SpecPipeline.crawledSeriesIdSet.add(item[1])
url=item[2]
# url = item
request = scrapy.Request(url=url, callback=self.parse)
request.meta['seriesParams'] = (item[0], item[1]) # (品牌ID,车系id)
# request.meta['seriesParams'] = ('122', '4887') # (品牌ID,车系id)
yield request
# 封装取出车型集合
def extractSpecItem(self,response):
# 解析
seriesParams = response.meta['seriesParams']
specDataGroups = response.css(".interval01-list")
specList=list()
for specDataGroup in specDataGroups:
for specDataItem in specDataGroup.xpath("li"):
# 车型id
specId = specDataItem.xpath("@data-value").extract_first()
specNameData = specDataItem.css("#p" + specId).xpath("a")
# 车型名称
specName = specNameData.xpath("text()").extract_first()
# 车型链接
specLink = self.https % specNameData.xpath("@href").extract_first()
specLink=specLink[0:specLink.find("#")-1]
specItem = SpecItem()
specItem['pinpaiID'] = seriesParams[0]
specItem['chexiID'] = seriesParams[1]
specItem['chexingID'] = specId
specItem['name'] = specName
specItem['url'] = specLink
specItem['sqlType'] = '1'
# self.log(specItem)
# 统计新增车型
if specId not in self.chexingIdSet:
SpecPipeline.addSpecCount += 1
# ruleId等于1时只更新新增的车型,已存在的不会做更新
if self.ruleId == 1:
if specId in self.chexingIdSet:
continue
self.log("yieldCount:%d" % self.count)
# 保存车型到数据库
self.count += 1
specList.append(specItem)
return specList
# 批量保存到数据库
# def parseSellingSpec(self,response):
# print(">>>>>>>>>>>>>>>>>>>>>>>>>>>parseSellingSpec")
# t=type(response)
# self.log(t)
# # 解析
# seriesParams = response.meta['seriesParams']
# specDataGroups = response.css(".interval01-list")
# self.log(seriesParams)
# self.log(specDataGroups)
# specList=list()
# for specDataGroup in specDataGroups:
# for specDataItem in specDataGroup.xpath("li"):
# # 车型id
# specId = specDataItem.xpath("@data-value").extract_first()
# specNameData = specDataItem.css("#p" + specId).xpath("a")
# # 车型名称
# specName = specNameData.xpath("text()").extract_first()
# # 车型链接
# specLink = self.https % specNameData.xpath("@href").extract_first()
# pingpaiID=seriesParams[0]
# chexiID=seriesParams[1]
# chexingID=specId
# specItem=(chexingID,pingpaiID,chexiID,specName,specLink)
# specList.append(specItem)
# self.log(specItem)
# # 保存车型到数据库
# self.count += 1
# self.log("saveCount:%d" % self.count)
# # yield specItem #只有在scrapy.request方法中指定的方法才支持yield
# # 使用mysqlUtils将数据保存到数据库
# MySqlUtils.insertSpecItemList(specList)
# def parseScoreAndPrice(self,response):
# #获取传递参数,车型对象
# specItem=response.meta['specItem']
# #解析评分
# scoreData = response.css(".koubei-data")
# score=0
# if scoreData:
# score = scoreData.xpath("span/a")[0].xpath("text()").extract_first()
# score = score[0:score.find("分")]
# #解析指导价
# priceData = response.css(".factoryprice")
# price=0
# if priceData:
# price = priceData.xpath("text()").extract_first()
# price=price.split(":")[1]
# specItem['money']=price
# specItem['score']=score
# self.log(specItem)
# #将车型信息保存到数据库
# yield specItem
Apocalypse SomedayTime Limit: 1000MS Memory Limit: 131072KTotal Submissions: 1490 Accepted: 686DescriptionThe number 666 is considered to be the occult “number _apocalypse someday
conn <- odbcConnect('myconnect',uid='SYSTEM',pwd='pwd')data 问题:通过RODBC连接oracle时报错:Error in .Call(C_RODBCFetchRows, attr(channel, "handle_ptr"), max, buffsize, :negative length vectors are not a
以下为解决此问题全过程:分析原因:在QtCreator加载时,出现错误信息:QXcbConnection: Failed to initialize XRandrRandr:Resize and RotateXRandr用于Qt调整和监测系统分辨率,猜测此处初始化失败导致上述问题。添加环境变量:export QT_DEBUG_PLUGINS=1输出Qt 错误日志发现:Got keys from plugin meta data ("xcb")QFactoryLoader::QFactor_qxcbconnection xrandr
在集成学习之Adaboost算法原理小结中,我们对Adaboost的算法原理做了一个总结。这里我们就从实用的角度对scikit-learn中Adaboost类库的使用做一个小结,重点对调参的注意事项做一个总结。1. Adaboost类库概述 scikit-learn中Adaboost类库比较直接,就是AdaBoostClassifier和AdaBoostRegressor两个,从名字..._ske.adaboost
官网:https://www.mongodb.com/try点击Installation后会进入安装页面:https://docs.mongodb.com/manual/tutorial/install-mongodb-on-windows-unattended/点击MongoDB Download Center由于网速原因,后面再下载这个。这里就选择安装V3.4.10版本,下载地址:http://www.kxdw.com/soft/20650.html然后自己新建._windows mongodb4.0.5百度云
选界面右下角panels-properties-page options-formatting and size-Template,里面有个Template选项可以调整原理图图纸大小。_ad18原理图图纸大小怎么改
微信小程序中,选项的默认样式是一个对勾,非常丑,虽然可以更改选中时图片的路径,但是不用图片, 用样式也是非常好看的,效果如下:image.png没啥可说的, 直接上代码,复制粘贴,放到 app.wxss中既可。 非常简单。更改小程序中 单选框、多选框的默认样式:/*更改全局 radio样式*/radio .wx-radio-input {border-radius: 50%;width: 28rp..._小程序多选对多选页面样式
参照网上的教程一步一步把Ubuntu Server 16.04 安装到主机。参照连接Ubuntu Server 16.04先检测是否联网,在命令行中输入 ping 8.8.8.8返回NetWork is unreachable输入ifconfig返回lo ....百度一番,发现网卡没启用,在命令行输入ifconfig eth0 up返回Operation not per..._wsl环境下宿主机执行ping 8.8.8.8没有响应
用设计模式开发通用数据库操作器 (转)[@more@] 我们都希望在开发软件的时候能少写一些代码,希望能到处使用,希望不用管什么样的数据库软件都能用,我们该怎么办呢? 我们操作数据库时用到些什么类 一般来说,我们对数...
传送门猴子选大王问题实质是约瑟夫环问题:一圈人进行报数,报到数字3的人出列(相当于被杀死)之后他后面的人再从0开始报数,如此循环往复,最后剩下的那个伙计就是最幸运的(也是这个问题中的大王),关于这个问题,b站有很详细的讲解:https://www.bilibili.com/video/BV1cV411e7ph?from=search&seid=9609394249167015402def choiceKing(maxNum,stepNum): data = [] delD_第14关:猴子选大王问题
#前言在暑假的时候想玩玩树莓派,就买了一块树莓派3B+,结果买回来也没太玩就放在宿舍吃灰,最近突然对网站很感兴趣,于是就在网上查找资料去搭建了这个web服务器,它是使用的nginx+PHP7+typecho组成的服务器。#首先安装raspbian系统引用了树莓派实验室的下载地址,大家可以直接下载。下载链接:http://downloads.raspberrypi.org/raspbian_..._树莓派web服务器
给定两个多项式A和B,求他们的商和余数. 注意对0的处理即可./* LittleFall : Hello! */#include <bits/stdc++.h>#define ll long longusing namespace std;inline int read();inline void write(int x);const int M = 100016...