python科技新闻爬取

python线上巡检 中尝到甜头之后,觉得python这门语言还真是实在,于是想了想,每天公交车上刷科技新闻,要是能主动把新闻整理好发送给我多好,于是撸起袖子就是干,搞了一个科技新闻爬虫。可以设置固定的时间去爬去,还可以自己写一些算法去筛选自己想要的新闻,代码简单,没有使用复杂的库,连bs都没用上。

代码实现:

#请求readhub
def readhubRequest(url, params, headers = None, method = 'POST'):
    status_code = 0
    json = 'no json'
    method = method.upper()
    
    try:
        if len(url) == 0:
            return (status_code, json)
        if method == 'POST':
            if headers != None:
                resp = requests.post(url = url, params = params, headers = headers)
            else:
                resp = requests.post(url = url, params = params)
            status_code = resp.status_code
            json = resp.json()
        elif method == 'GET':
            if headers != None:
                resp = requests.get(url = url, params = params, headers = headers)
            else:
                resp = requests.get(url = url, params = params)
            status_code = resp.status_code
            json = resp.json()

    except Exception as e:
        print e
    #print json  #打印看看成果
    return (status_code, json)
    pass


#发送到邮箱 查看科技新闻吧
def sendmail(content):
    # 第三方 SMTP 服务
    mail_host = "smtp.qq.com"  #设置服务器
    mail_user = ""   #用户名,
    mail_pass = ""   #口令
    
    sender = '发送者'
    receivers = [] #接收者
    
    message = MIMEText('\n'.join(content), 'plain', 'utf-8')
    message['From'] = Header("发送人", 'utf-8')
    message['To'] =  Header("接收人", 'utf-8')
    
    subject = '科技新闻--Python爬虫'
    message['Subject'] = Header(subject, 'utf-8')
    
    try:
        smtpObj = smtplib.SMTP_SSL(mail_host, 465)
        #smtpObj.connect(mail_host, 25)
        smtpObj.login(mail_user, mail_pass)
        smtpObj.sendmail(sender, receivers, message.as_string())
    #print "发送成功"
    except smtplib.SMTPException as e:
        print e
    pass


def getnews():
    #请求数据,这里是从readhub爬取,可以换为今日头条什么的
    (code, json) = readhubRequest("https://api.readhub.me/topic", {"lastCursor" : "", "pageSize" : 20}, None, 'GET')
    #数据拿到了,整理一下,发送邮件或者干其他
    news = []
    if "data" in json:
        for new in json["data"]:
            if "title" in new:
                news.append(new["title"])
    #print news
    if (len(news) > 0):
        sendmail(news)
    pass
	
#schedule.py 时间脚本,控制爬取时间
schedulelist = [
                  {
                  "hour":00,
                  "minute":01,
                  "second":20
                  },
                  {
                  "hour":00,
                  "minute":02,
                  "second":20
                  },
                  {
                    "hour":9,
                    "minute":30,
                    "second":00
                  },
                  {
                  "hour":23,
                  "minute":59,
                  "second":30
                  }
                  ]

def addCount(count,total):
    count = count + 1
    if count == total:
        count = 0
    return count

def nextTime(item):
    curTime = datetime.now()
    hour = item["hour"]
    minute = item["minute"]
    second = item["second"]
    desTime = curTime.replace(hour = hour, minute = minute, second = second, microsecond = 0)
    return  desTime
    pass

def run():
    index = 0
    while True:
        try:
            curTime = datetime.now()
            
            total_count = len(schedulelist)
            
            item = schedulelist[index]
            #print "当前时间" + str(curTime)
  
            desTime = nextTime(item)
            delta = desTime - curTime
            skipSeconds =  delta.total_seconds()
            #print ("距离下次还有%d秒" % skipSeconds)
            
            if skipSeconds < 0 :
                #配置为明天第一个任务
                index = 0
                #今天任务做完,睡到第二天
                curTime = datetime.now()
                tmptime = curTime.replace(hour = 23, minute = 59, second = 59, microsecond = 0)
                skipSeconds = (tmptime - curTime).total_seconds()
                #print ("距离明天还有 %d 秒" % skipSeconds)
                
                item = schedulelist[index]
                desTimet = nextTime(item)
                #print "明天第一个任务的时间:" + str(desTimet)
                #print ("距离明天第一个任务还有 %d 秒" % (desTimet.hour * 3600 + desTimet.minute * 60 + desTimet.second))
                #print "要睡觉了"
                time.sleep(skipSeconds + 1)
                #print "不会到这来"
                continue
        
            #print ("skipSeconds = %d" % skipSeconds)
            time.sleep(skipSeconds)
            index = addCount(index,total_count)
            #print "这次任务已经完成,开始下个任务"
            #到点了,该做事了  0-6 星期一至星期日
            today = datetime.now().weekday()
            if (today == 5 or today == 6):
                #老子双休不干活
                pass
            else:
                #周末耍完了 上班了
                readhub.getnews()
                pass
        except Exception as e:
            print e

这只是相当初级的内容爬取,甚至连header都不用去模拟,更别说UA,IP限制等等

最近的文章

python图片处理:切图

继爬取科技新闻之后,可以说大大的感觉到python的好处,不管是做些偷懒的事还是其他(至于是什么事就不吐露了),某次巧合之中,iOS程序需要更换APPicon,但是偏偏美工同学繁忙,怎么办,写个代码自己切图。主要使用PIL 图片处理库,这个库的功能太过复杂,包括切片、旋转、滤镜、输出文字、调色板,在这里主要利用一点皮毛功能。PIL 的安装windows安装PIL,pillow取代PIL pip install pillow在Debian/Ubuntu Linux下直接通过apt安装: ...…

人生苦短,就用python继续阅读
更早的文章

python自动化的使用

背景之前在某公司,发现测试人员每天固定时间点都要发一些巡检报告,有些时间点很早,并且感觉时时都要知道线上情况,看着真的为他们感觉到累。于是写代码的就不安分了,这么无聊的重复性工作为啥要人来做,让代码去做不是很好吗?初窥python作为这么一门网红语言,不会真是有点说不过去。出于上面的需求,直接不管青红皂白就创建了一个python项目,虽然不知道python的相关api,但是程序思路是有的。大概思路就是写一个无限循环的程序,让他定时去check服务器的相关接口,然后检查接口的返回情况来决定服...…

人生苦短,就用python继续阅读