上次發現了以下問題:
該網站的頁數瀏覽權限只開到19頁 (第20頁後會出現403EORROR)
線上Mysql資料庫有連接次數上限(500/hr)
所以這次大改了整個程式
這次更新與處理的地方:
更改爬蟲的方式,直接查文章編號
加入更多取得的資料(文章編號、品牌、大類別、子分類、評分)
更改資料庫為本地資料庫Sqlite
以下為這次的程式架構圖:
總共有五個函式:
1.menu()
列印出主選單
2.printpages()
讓使用者輸入要查詢的文章編號並執行爬蟲程式
3.search_category
找出該文章的類別、分類、子分類
4.allin
爬蟲程式(爬出10項資料)
5.connect_sqlite
檢測是否有重複資料,若無則增加新資料
1.主程式與主程式選單
主程式選單:
印出給使用者看的選單
這邊做了例外處理
若使用者輸入非數值,或是字串,都會print出("請輸入選項的代號")
主程式:
會導入所有函式需要用到的模組
#主選單 def menu(): print("========urcosme爬蟲========") print() print("請輸入選項代號") print("1.輸入文章編號") print("2.結束") print() print() print("========urcosme爬蟲========") #以下為主程式 from urllib import request #導入模組 from bs4 import BeautifulSoup #導入模組 from urllib.parse import urlparse #導入模組 from urllib.request import urlopen from urllib.error import HTTPError import re import sqlite3 while True: menu() try: choice=int(input("請輸入你的選擇:")) if choice == 2: break; elif choice == 1: printpages() else: print("請輸入選項的代號") except: print("請輸入選項的代號") x = input("請按Enter鍵回主選單")
需要注意的地方:
因為已經int(choice)了
所以如果輸入非數值會直接出錯,不會跳到else,
因此要使用try、except來處理
2.使用者輸入要爬取的文章編號並迴圈
由menu()呼叫出來
讓使用者輸入文章編號
並顯示目前正在爬取的文章編號
最後呼叫 allin() 執行爬蟲程式
#選擇文章編號 def printpages(): global page try: pages1 = int(input("請輸入要爬取的起始文章編號")) pages2 = int(input("請輸入要爬取的結尾文章編號")) except: print("請輸入數值") if pages1 >= 1 and pages1 <= 1109000 and pages2 >= pages1: for page in range(pages1,pages2+1): print("目前正在爬編號:",page) try: allin('https://www.urcosme.com/reviews/{}'.format(page)) except HTTPError as e: print("該網址沒有文章") else: print("請輸入介於 1 ~ 1109000 之間")
3.類別、分類、子類
由allin()呼叫出來
這是一項大工程
把整個網站的分類都輸入進來為了能幫資料分類
採用比較原始的做法
#類別、分類、子分類 def search_categpry(): global mcate global cate global catt if catt == '洗面皂' or catt =='洗面乳' or catt =='洗顏粉' or catt =='洗顏慕斯' or catt=='其它洗顏': cate = '洗臉' elif catt == '卸妝乳'or catt =='卸妝油'or catt =='卸妝露'or catt =='卸妝水'or catt =='卸妝霜'or catt =='眼唇卸妝'or catt =='卸妝棉'or catt =='其它卸妝': cate = '卸妝' elif catt == '乳液': cate = '乳液' elif catt == '乳霜': cate = '乳霜' elif catt == '凝霜': cate = '凝霜' elif catt == '凝膠': cate = '凝膠' elif catt == '化妝水': cate = '化妝水' elif catt == '導入液'or catt =='前導精華'or catt =='其它前導': cate = '前導' elif catt == '精華液'or catt =='精華油'or catt =='安瓶'or catt =='其它精華': cate = '精華' elif catt == '保養面膜'or catt =='清潔面膜': cate = '面膜' elif catt == '多功能保養': cate = '多功能保養' elif catt == '臉部去角質'or catt =='唇部去角質'or catt =='其它去角質': cate = '去角質' elif catt == '眼霜'or catt =='眼膜'or catt =='眼部精華'or catt =='睫毛液'or catt =='其它眼部保養': cate = '眼睫保養' elif catt == '護唇膏'or catt =='護唇精華'or catt =='唇膜'or catt =='其它唇部保養': cate = '唇部保養' elif catt == '臉部防曬': cate = '臉部防曬' elif catt == '身體防曬': cate = '身體防曬' elif catt == '隔離霜'or catt =='眼部打底'or catt =='其它妝前': cate = '妝前' elif catt == '遮瑕膏'or catt =='遮瑕筆'or catt =='眼部遮瑕'or catt =='其它遮瑕': cate = '遮瑕' elif catt == '粉底液'or catt =='粉餅'or catt =='粉霜'or catt =='氣墊粉餅'or catt =='BB霜'or catt =='CC霜'or catt =='其它粉底': cate = '粉底' elif catt == '蜜粉'or catt =='蜜粉餅'or catt =='其它定妝': cate = '定妝' elif catt == '眉筆'or catt =='眉粉'or catt =='染眉膏'or catt =='其它眉彩': cate = '眉彩' elif catt == '眼線筆'or catt =='眼線液'or catt =='眼線膠'or catt =='其它眼線': cate = '眼線' elif catt == '眼影盤'or catt =='眼影膏'or catt =='眼影筆'or catt =='眼影蜜'or catt =='其它眼影': cate = '眼影' elif catt == '睫毛膏'or catt =='睫毛底膏'or catt =='睫毛定型': cate = '睫毛' elif catt == '腮紅'or catt =='腮紅霜'or catt =='腮紅蜜'or catt =='氣墊腮紅'or catt =='其它腮紅': cate = '頰彩' elif catt == '修容棒'or catt =='修容餅'or catt =='其它修容': cate = '修容' elif catt == '唇膏'or catt =='唇筆'or catt =='唇線筆'or catt =='唇蜜'or catt =='唇釉'or catt =='唇露'or catt =='其它唇彩': cate = '唇彩' elif catt == '指甲油'or catt =='基底油'or catt =='護甲油'or catt =='去光水'or catt =='其它美甲工具': cate = '美甲' elif catt == '多功能彩妝': cate = '多功能彩妝' elif catt == '身體乳液'or catt =='身體乳霜'or catt =='身體按摩油'or catt =='身體去角質'or catt =='其它美體保養': cate = '美體保養' elif catt == '護手霜'or catt =='指緣油'or catt =='手膜'or catt =='其它手部保養': cate = '手部保養' elif catt == '足膜'or catt =='足部舒緩'or catt =='其它腿足保養': cate = '腿足保養' elif catt == '美胸霜'or catt =='其它保養': cate = '其它部位保養' elif catt == '私密保養'or catt =='私密清潔': cate = '私密護理' elif catt == '沐浴乳'or catt =='沐浴露'or catt =='肥皂'or catt =='入浴劑'or catt =='其它沐浴清潔': cate = '沐浴清潔' elif catt == '止汗膏'or catt =='爽身噴霧'or catt =='其它爽身制汗': cate = '爽身制汗' elif catt == '美白牙膏'or catt =='其它牙齒保養': cate = '牙齒保養' elif catt == '洗髮乳'or catt =='乾洗髮'or catt =='其它洗髮': cate = '洗髮' elif catt == '潤髮乳'or catt =='其它潤髮': cate = '潤髮' elif catt == '護髮乳'or catt =='護髮霜'or catt =='髮膜'or catt =='護髮油'or catt =='護髮素'or catt =='其它護髮': cate = '護髮' elif catt == '頭皮護理': cate = '頭皮護理' elif catt == '染髮劑'or catt =='泡泡染'or catt =='其它染髮': cate = '染髮' elif catt == '定型噴霧'or catt =='髮蠟'or catt =='髮膠'or catt =='髮乳'or catt =='慕斯'or catt =='髮妝水'or catt =='其它頭髮造型': cate = '頭髮造型' elif catt == '香水'or catt =='淡香水'or catt =='香精'or catt =='淡香精': cate = '香水香精' elif catt == '其它香水香氛': cate = '其它香水香氛' elif catt == '化妝棉'or catt =='洗臉工具'or catt =='臉部按摩'or catt=='其它臉部保養工具': cate = '臉部保養工具' elif catt == '刷具'or catt =='睫毛夾'or catt =='海綿粉撲'or catt =='假睫毛'or catt =='用具清潔'or catt =='其它彩妝工具': cate = '彩妝工具' elif catt == '沐浴工具'or catt =='身體按摩'or catt =='其它身體保養工具': cate = '身體保養工具' elif catt == '梳子'or catt =='洗髮工具'or catt =='頭皮按摩'or catt =='其它美髮工具': cate = '美髮工具' elif catt == '洗臉機'or catt =='吹風機'or catt =='其它美容家電': cate = '美容家電' if cate == '洗臉'or cate =='卸妝'or cate =='化妝水'or cate =='乳液'or cate =='乳霜'or cate =='凝霜'or cate =='凝膠'or cate =='前導'or cate =='精華'or cate =='面膜'or cate =='多功能保養': mcate = '基礎保養' elif cate == '去角質'or cate =='眼睫保養'or cate =='唇部保養'or cate =='進階保養': mcate = '進階護膚' elif cate == '臉部防曬'or cate =='身體防曬': mcate = '防曬' elif cate == '妝前'or cate =='遮瑕'or cate =='粉底'or cate =='定妝': mcate = '底妝' elif cate == '眉彩'or cate =='眼線'or cate =='眼影'or cate =='睫毛'or cate =='頰彩'or cate =='修容'or cate =='唇彩'or cate =='美甲'or cate =='多功能彩妝': mcate = '彩妝' elif cate == '美體保養'or cate =='手部保養'or cate =='腿足保養'or cate =='其它部位保養'or cate =='私密護理'or cate =='沐浴清潔'or cate =='爽身制汗'or cate =='牙齒保養': mcate = '身體保養' elif cate == '洗髮'or cate =='潤髮'or cate =='護髮'or cate =='頭皮護理'or cate =='染髮'or cate =='頭髮造型': mcate = '美髮' elif cate == '香水香精'or cate =='其它香水香氛': mcate = '香水香氛' elif cate == '臉部保養工具'or cate =='彩妝工具'or cate =='身體保養工具'or cate =='美髮工具'or cate =='美容家電': mcate = '美容工具' print("類別:",mcate) print("分類:",cate)
可以優化的目標:
用爬蟲的方式爬出該網站的分類選單
取代上面的手動輸入內容
4.爬蟲程式
爬出10項資料
其中爬出'子分類'後會呼叫search_category()來找出'類別'與'分類'
#爬蟲程式 def allin(linkkk): global mcate global cate global catt global bbb global ttt global imgg global ct global ci global sc try: url = linkkk #選擇網址 user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 6.1; zh-CN; rv:1.9.2.15) Gecko/20110303 Firefox/3.6.15' headers = {'User-Agent':user_agent} data_res = request.Request(url=url,headers=headers) data = request.urlopen(data_res) data = data.read().decode('utf-8') sp = BeautifulSoup(data, "lxml") #以下為標籤 title = sp.findAll("span",{"itemprop":"name"}) tt = [] for t in title: tt.append(t) try: ttt = tt[3].text except: print("此心得文網址有誤") return print('標籤:',ttt) #以下為品牌 brand = sp.findAll("span",{"itemprop":"name"}) bb = [] for b in brand: bb.append(b) bbb = bb[2].text print(bbb) #以下為子分類 try: categorys = sp.find("div",{"class":"uc-container uc-search-final-sidebar-ranking"}).findAll("div",{"class":"uc-container-title"}) for category in categorys: catt = category.text.strip('排行榜') print(catt) except: print("特殊分類不納入資料庫") return #以下為類別與分類 search_categpry() #以下為主圖片 print('主圖片連結:') img1 = sp.find("div",{"class":"main-image"}).findAll("img", src = re.compile("\/review_image\/")) img2 = sp.find("div",{"class":"main-image"}).findAll("img", src = re.compile("\/product_image\/")) for img in img1: if img1 != []: imgg = img['src'] print(img['src']) for img in img2: if img2 != []: imgg = img['src'] print(img['src']) if img1 == [] and img2 == []: imgg = '無圖片' print('無圖片') #以下為內文 print('內文:') contents = sp.findAll("div",{"class":"review-content"}) for content in contents: print(content.text.replace("✨","").replace("😂","").replace("👍🏻","")) ct = content.text.replace("✨","").replace("👍🏻","") #以下為內文圖片 print('內文圖片:') c_img1 = sp.find("div",{"class":"review-content"}).findAll("img", src = re.compile("\/review")) cc_img = [] if c_img1 != []: #由於上面的程式就會尋找\/review_image\/,所以在這邊就要先確認是否有找到資料 for c_img in c_img1: cc_img.append(c_img['src']) print(str(cc_img[:]).replace("["," ").replace("]"," ").replace("'"," ").replace(", "," ")) #先轉成str 再用replace把額外的符號都換成空白 ci = str(cc_img[:]).replace("["," ").replace("]"," ").replace("'"," ").replace(", "," ") else: ci = '無圖片' print(ci) #以下為評分 try: scores = sp.findAll("div",{"class":"review-score"}) for score in scores: sc = score.text print("分數:",sc) except: sc = "無評分" print(sc) connect_sqlite() except HTTPError: print("該網址無效")
需要注意的地方:
如果網頁出現404 error會跳出例外
需要用except HTTPError的方法來避免產生例外
3.連接Sqlite資料庫
由allin()呼叫出來
#連接Sqlite資料庫 def connect_sqlite(): global mcate global cate global catt global bbb global ttt global imgg global ct global ci global sc global page db = sqlite3.connect('urcosme心得.sqlite') cursor = db.cursor() sqlstr = "SELECT * FROM `urcosme心得` WHERE 文章編號 = '%s'" % (page) try: cursor.execute(sqlstr) results = cursor.fetchall() if len(results) == 0: try: cursor.execute('INSERT INTO `urcosme心得` (`文章編號`,`類別`,`分類`,`子分類`,`品牌`,`商品名稱`, `主圖片`, `內文`, `內文圖片`, `評分`)values("%s","%s","%s","%s","%s","%s","%s","%s","%s","%s")'%(page,mcate,cate,catt,bbb,ttt,imgg,ct,ci,sc)) print("成功儲存新文章") except: print("此文章有特殊字元") db.commit() db.close() elif len(results) >= 1: print('已有重複文章') db.commit() db.close() except: print("此文章有例外情況無法存入") db.commit() db.close()
需要注意的地方:
上次採用主圖片來避免重複文章
但是有些文章會沒有圖片
所以這次採用文章編號的方式來避免重複
目前已知可以優化的目標:
1.尋找分類用爬蟲來獲得
接下來要做的事情:
1.繼續優化
2.繼續測試並找出錯誤
3.寫一個app提取資料庫內如並且顯示出來
留言列表