上次發現了許多問題,這次又做了很多優化
增加一個函式 use_category()
以下是這次的程式流程圖:
這次共有8個函式:
1.主程式與主程式選單:menu()
2.分類選單:showall()
3.使用者輸入要爬蟲的頁數與分類:howmanypages()
4.分類選擇:choose_category
5.迴圈分類與頁數:use_category()
6.爬出該分類的頁數中所有心得文網址:pageurl()
7.執行爬蟲程式(導入心得文網址):allin()
8.連接資料庫並上傳:connect_mysql
1.主程式與主程式選單
主程式選單:
印出給使用者看的選單
這邊做了例外處理
若使用者輸入非數值,或是字串,都會print出("請輸入選項的代號")
主程式:
會導入所有函式需要用到的模組
並宣告各個會用到的全域變數
※全域變數:在每個函式裡面共同使用的變數
# _*_ coding: utf-8 _*_
from urllib import request #導入模組
from bs4 import BeautifulSoup #導入模組
from urllib.parse import urlparse #導入模組
from urllib.request import urlopen
import pymysql
from urllib.error import HTTPError
import re
global ttt
global imgg
global ct
global ci
global category
global category #所有全域變數
global pages1
global pages2
global page
global number
while True:
menu()
try:
choice=int(input("請輸入你的選擇:"))
if choice == 2:
break;
elif choice == 1:
showall()
else:
print("請輸入選項的代號")
except:
print("請輸入選項的代號")
x = input("請按Enter鍵回主選單")
需要注意的地方:
因為已經int(choice)了
所以如果輸入非數值會直接出錯,不會跳到else,
因此要使用try、except來處理
2.分類選單
由menu()呼叫出來
先印出給使用者看的選單
並且附上說明
最後呼叫howmanypages()
def showall():
print("========分類選單========")
print("a.基礎保養:1.洗臉 2.卸妝 3.化妝水 4.乳液 5.乳霜 6.凝霜")
print(" 7.凝膠 8.前導 9.精華 10.面膜 11.多功能")
print("b.防曬 :12.臉部防曬 13.身體防曬")
print()
print("c.底妝 :14.妝前 15.遮瑕 16.粉底 17.定妝")
print()
print("d.彩妝 :18.眉彩 19.眼線 20.眼影 21.睫毛 22.頰彩 23.修容")
print(" 24.唇彩 25.美甲 26.多功能彩妝")
print("all.全部")
print()
print(" 說明:請先輸入要爬取的分類代號,再輸入要爬取的初始與結尾頁數")
print(" ,若選擇大類別,則該類別下所有分類皆會爬取同樣多的頁數")
print()
print()
howmanypages()
3.使用者輸入要爬蟲的頁數
由showall()呼叫出來
讓使用者輸入要爬取的分類
讓使用者輸入要爬取的初始、結尾頁數
把這三項數值都設為全域變數,才能讓其他函式取用
將初始、結尾頁數轉換成int(整數數值),才能將其代入range
迴圈頁數,並呼叫choose_category()來選擇分類
def howmanypages():
global category #使用全域變數
global pages1
global pages2
global page
global number
number = input("請輸入要爬取的分類")
pages1 = input("請輸入要爬取的初始頁數")
pages2 = input("請輸入要爬取的結尾頁數")
pages3 = int(pages1)
pages4 = int(pages2)
for page in range(pages3,pages4+1):
choose_category(number)
需要注意的地方:
全域變數不能用數值型態
所以要設其他變數來改變後
才能代入range裡面
4.分類選擇
目前還沒有想出更好的做好,之後可能還可以再做優化
先宣告category為全域變數,讓其他函式也可以使用
number就是由howmanypages()函式給予的變數,是使用者選擇分類時的輸入
根據number的不同,category對應到該網站不同類別的網址
之後呼叫use_category(),將分類選擇與迴圈頁數做結合
def choose_category(number):
global category #宣告全域變數
if number == '1':
c_number = 11
category = 'http://www.urcosme.com/tags/{}/reviews?page='.format(c_number)
use_category()
elif number == '2':
c_number = 12
category = 'http://www.urcosme.com/tags/{}/reviews?page='.format(c_number)
use_category()
elif number == '3':
c_number = 13
category = 'http://www.urcosme.com/tags/{}/reviews?page='.format(c_number)
use_category()
elif number == '4':
c_number = 14
category = 'http://www.urcosme.com/tags/{}/reviews?page='.format(c_number)
use_category()
elif number == '5':
c_number = 15
category = 'http://www.urcosme.com/tags/{}/reviews?page='.format(c_number)
use_category()
elif number == '6':
c_number = 16
category = 'http://www.urcosme.com/tags/{}/reviews?page='.format(c_number)
use_category()
elif number == '7':
c_number = 17
category = 'http://www.urcosme.com/tags/{}/reviews?page='.format(c_number)
use_category()
elif number == '8':
c_number = 18
category = 'http://www.urcosme.com/tags/{}/reviews?page='.format(c_number)
use_category()
elif number == '9':
c_number = 19
category = 'http://www.urcosme.com/tags/{}/reviews?page='.format(c_number)
use_category()
elif number == '10':
c_number = 20
category = 'http://www.urcosme.com/tags/{}/reviews?page='.format(c_number)
use_category()
elif number == '11':
c_number = 21
category = 'http://www.urcosme.com/tags/{}/reviews?page='.format(c_number)
use_category()
elif number == '12':
c_number = 26
category = 'http://www.urcosme.com/tags/{}/reviews?page='.format(c_number)
use_category()
elif number == '13':
c_number = 27
category = 'http://www.urcosme.com/tags/{}/reviews?page='.format(c_number)
use_category()
elif number == '14':
c_number = 28
category = 'http://www.urcosme.com/tags/{}/reviews?page='.format(c_number)
use_category()
elif number == '15':
c_number = 29
category = 'http://www.urcosme.com/tags/{}/reviews?page='.format(c_number)
use_category()
elif number == '16':
c_number = 30
category = 'http://www.urcosme.com/tags/{}/reviews?page='.format(c_number)
use_category()
elif number == '17':
c_number = 31
category = 'http://www.urcosme.com/tags/{}/reviews?page='.format(c_number)
use_category()
elif number == '18':
c_number = 32
category = 'http://www.urcosme.com/tags/{}/reviews?page='.format(c_number)
use_category()
elif number == '19':
c_number = 33
category = 'http://www.urcosme.com/tags/{}/reviews?page='.format(c_number)
use_category()
elif number == '20':
c_number = 34
category = 'http://www.urcosme.com/tags/{}/reviews?page='.format(c_number)
use_category()
elif number == '21':
c_number = 35
category = 'http://www.urcosme.com/tags/{}/reviews?page='.format(c_number)
use_category()
elif number == '22':
c_number = 36
category = 'http://www.urcosme.com/tags/{}/reviews?page='.format(c_number)
use_category()
elif number == '23':
c_number = 37
category = 'http://www.urcosme.com/tags/{}/reviews?page='.format(c_number)
use_category()
elif number == '24':
c_number = 38
category = 'http://www.urcosme.com/tags/{}/reviews?page='.format(c_number)
use_category()
elif number == '25':
c_number = 39
category = 'http://www.urcosme.com/tags/{}/reviews?page='.format(c_number)
use_category()
elif number == '26':
c_number = 40
category = 'http://www.urcosme.com/tags/{}/reviews?page='.format(c_number)
use_category()
elif number =='a':
c_numbers = ['11','12','13','14','15','16','17','18','19','20','21']
for c_number in c_numbers:
category = 'http://www.urcosme.com/tags/{}/reviews?page='.format(c_number)
use_category()
elif number =='b':
c_numbers = ['26','27']
for c_number in c_numbers:
category = 'http://www.urcosme.com/tags/{}/reviews?page='.format(c_number)
use_category()
elif number =='c':
c_numbers = ['28','29','30','31']
for c_number in c_numbers:
category = 'http://www.urcosme.com/tags/{}/reviews?page='.format(c_number)
use_category()
elif number =='d':
c_numbers = ['32','33','34','35','36','37','38','39','40']
for c_number in c_numbers:
category = 'http://www.urcosme.com/tags/{}/reviews?page='.format(c_number)
use_category()
elif number =='all':
c_numbers = ['11','12','13','14','15','16','17','18','19','20','21','26','27','28','29','30','31','32','33','34','35','36','37','38','39','40']
for c_number in c_numbers:
category = 'http://www.urcosme.com/tags/{}/reviews?page='.format(c_number)
use_category()
需要注意的是:
如果選擇大類別(包含多個分類)
則會經由一個for迴圈裡面的分類
例如分類 b 裡面有12、13
則輸入 b 分類選項 經由這個for會先跑完 12分類 的所有結果,再跑 13分類 的所有結果
5.迴圈分類與頁數
由choose_category()呼叫出來
會把category與pages迴圈做結合
形成使用者 "需要的分類" + "指定的頁數 " 的正確網址
再呼叫pageurl()
對該網址進行心得文網址的爬蟲
def use_category():
global category #使用全域變數
global pages1
global pages2
global page
global number
final = ((category)+str(page))
print("爬取的分類:",number)
print("從第",pages1,"頁至第",pages2,"頁")
print("目前爬取到第",page,"頁")
pageurl(final)
需要注意的地方:
final 要做字串的相加,所以page要轉換成str的型態
6.爬出該網址中的所有心得文網址
由use_category()呼叫
將獲得的網址代入近來
取得所有該頁面中所有心得文網址
將多個心得文網址放進一個串列
再用迴圈的方式選取心得文網址 呼叫allin()函式 來爬取
做了一個例外處理,如果取得的心得文網址有問題
會print出("此心得文網址有誤")
def pageurl(cc):
url = cc
user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 6.1; zh-CN; rv:1.9.2.15) Gecko/20110303 Firefox/3.6.15'
headers = {'User-Agent':user_agent}
data_res = request.Request(url=url,headers=headers)
data = request.urlopen(data_res)
data = data.read().decode('utf-8')
sp = BeautifulSoup(data, "lxml") #使用BeautifulSoup函數分析網頁
domain = "http://www.urcosme.com" #由於內容中不包含前面網址,所以先自訂義
all_links = sp.find_all('a') #找出所有a標籤中的內容
morelink = []
for link in all_links: #從a標籤中的內容找出href標籤
href = link.get('href')
if href != None and href.startswith('/reviews/'):#選擇href中 包含有/reviews/的內容
more = domain+href
morelink.append(more)
mls = morelink
for ml in mls:
if ml.startswith('http://www.urcosme.com/reviews/'):
allin(ml)
else:
print("此心得文網址有誤")
7.執行爬蟲程式
宣告各個需要用到的全域變數
爬蟲獲得資料後,呼叫connect_mysql()函式來連接資料庫
這次測試的時候發現一個問題
由於該網站的心得文網址會出現直接導入回網站首頁的情況
心得文網址是正確的,所以該例外無法再pageurl()先做處理
該情況會造成爬蟲程式爬不到任何東西導致出錯
在"標籤"中做了例外處理
若標籤爬不到任何東西,就會print出("此心得文網址有誤")
再使用return來跳出函式
def allin(linkkk):
global ttt
global imgg
global ct
global ci
url = linkkk #選擇網址
user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 6.1; zh-CN; rv:1.9.2.15) Gecko/20110303 Firefox/3.6.15'
headers = {'User-Agent':user_agent}
data_res = request.Request(url=url,headers=headers)
data = request.urlopen(data_res)
data = data.read().decode('utf-8')
sp = BeautifulSoup(data, "lxml")
#以下為標籤
title = sp.findAll("span",{"itemprop":"name"})
tt = []
for t in title:
tt.append(t)
try:
ttt = tt[4].text
except:
print("此心得文網址有誤")
return
print('標籤:',ttt)
#以下為主圖片
print('主圖片連結:')
img1 = sp.find("div",{"class":"main-image"}).findAll("img", src = re.compile("\/review_image\/"))
img2 = sp.find("div",{"class":"main-image"}).findAll("img", src = re.compile("\/product_image\/"))
for img in img1:
if img1 != []:
imgg = img['src']
print(img['src'])
for img in img2:
if img1 != []:
imgg = img['src']
print(img['src'])
if img1 == [] and img2 == []:
imgg = '無圖片'
print('無圖片')
#以下為內文
print('內文:')
contents = sp.findAll("div",{"class":"review-content"})
for content in contents:
print(content.text.replace("😊","").replace("🔺","").replace("😘","").replace("😌","").replace("😍","").replace("😳","").replace("😅","").replace("😆","").replace("✨","").replace("😂","").replace("👍🏻","").replace("😊",""))
ct = content.text.replace("😊","").replace("🔺","").replace("😘","").replace("😌","").replace("😍","").replace("😳","").replace("😅","").replace("😆","").replace("✨","").replace("😂","").replace("👍🏻","").replace("😊","")
#以下為內文圖片
print('內文圖片:')
c_img1 = sp.find("div",{"class":"review-content"}).findAll("img", src = re.compile("\/review"))
cc_img = []
if c_img1 != []: #由於上面的程式就會尋找\/review_image\/,所以在這邊就要先確認是否有找到資料
for c_img in c_img1:
cc_img.append(c_img['src'])
print(str(cc_img[:]).replace("["," ").replace("]"," ").replace("'"," ").replace(", "," ")) #先轉成str 再用replace把額外的符號都換成空白
else:
print('無圖片')
ci = str(cc_img[:]).replace("["," ").replace("]"," ").replace("'"," ").replace(", "," ")
connect_mysql()
需要注意的地方:
跳出函式的方法:return
8.連接資料庫並上傳
宣告各個全域變數,才能使用剛剛爬蟲程式獲得的內容
根據不同的category,來改變要存入的不同資料庫table
使用fetchall找尋是否已有相同主圖片網址,避免重複存入
def connect_mysql():
global category
global ttt
global imgg
global ct
global ci
if category == 'http://www.urcosme.com/tags/11/reviews?page=':
cate = '`洗臉`'
elif category == 'http://www.urcosme.com/tags/12/reviews?page=':
cate = '`卸妝`'
elif category == 'http://www.urcosme.com/tags/13/reviews?page=':
cate = '`化妝水`'
elif category == 'http://www.urcosme.com/tags/14/reviews?page=':
cate = '`乳液`'
elif category == 'http://www.urcosme.com/tags/15/reviews?page=':
cate = '`乳霜`'
elif category == 'http://www.urcosme.com/tags/16/reviews?page=':
cate = '`凝霜`'
elif category == 'http://www.urcosme.com/tags/17/reviews?page=':
cate = '`凝膠`'
elif category == 'http://www.urcosme.com/tags/18/reviews?page=':
cate = '`前導`'
elif category == 'http://www.urcosme.com/tags/19/reviews?page=':
cate = '`精華`'
elif category == 'http://www.urcosme.com/tags/20/reviews?page=':
cate = '`面膜`'
elif category == 'http://www.urcosme.com/tags/21/reviews?page=':
cate = '`多功能保養`'
elif category == 'http://www.urcosme.com/tags/26/reviews?page=':
cate = '`臉部防曬`'
elif category == 'http://www.urcosme.com/tags/27/reviews?page=':
cate = '`身體防曬`'
elif category == 'http://www.urcosme.com/tags/28/reviews?page=':
cate = '`妝前`'
elif category == 'http://www.urcosme.com/tags/29/reviews?page=':
cate = '`遮瑕`'
elif category == 'http://www.urcosme.com/tags/30/reviews?page=':
cate = '`粉底`'
elif category == 'http://www.urcosme.com/tags/31/reviews?page=':
cate = '`定裝`'
elif category == 'http://www.urcosme.com/tags/32/reviews?page=':
cate = '`眉彩`'
elif category == 'http://www.urcosme.com/tags/33/reviews?page=':
cate = '`眼線`'
elif category == 'http://www.urcosme.com/tags/34/reviews?page=':
cate = '`眼影`'
elif category == 'http://www.urcosme.com/tags/35/reviews?page=':
cate = '`睫毛`'
elif category == 'http://www.urcosme.com/tags/36/reviews?page=':
cate = '`頰彩`'
elif category == 'http://www.urcosme.com/tags/37/reviews?page=':
cate = '`修容`'
elif category == 'http://www.urcosme.com/tags/38/reviews?page=':
cate = '`唇彩`'
elif category == 'http://www.urcosme.com/tags/39/reviews?page=':
cate = '`美甲`'
elif category == 'http://www.urcosme.com/tags/40/reviews?page=':
cate = '`多功能彩妝`'
db = pymysql.connect(
host='xxxxxx',
user='xxxxxx',
passwd='xxxxxx',
database='xxxxxx',
charset='utf8',)
cursor = db.cursor()
sqlstr = "SELECT * FROM %s WHERE 主圖片 = '%s'" % (cate,imgg)
try:
cursor.execute(sqlstr)
results = cursor.fetchall()
if len(results) == 0:
try:
cursor.execute('INSERT INTO %s (`標籤名稱`, `主圖片`, `內文`, `內文圖片`)values("%s","%s","%s","%s")'%(cate,ttt,imgg,ct,ci))
print("成功儲存新文章")
except:
print("此文章有特殊字元")
db.commit()
db.close()
elif len(results) >= 1:
print('已有重複文章')
db.commit()
db.close()
except:
print("此文章有例外情況無法存入")
db.commit()
db.close()
需要注意的地方:
上次fetchall是尋找內文,但由於會有特殊符號導致系統會辨識失敗
因而存入相同文章
採用主圖片網址後成功避免該情況的發生
目前已知可以優化的目標:
1.內文的特殊字元、表情符號會出現例外,沒產生例外也會變成亂碼存入
2."choose_category()" 該函式程式碼重複性很高,看能不能做優化
接下來要做的事情:
1.繼續優化
2.繼續測試並找出錯誤
3.寫一個app提取資料庫內如並且顯示出來
