Python Django 實作(二) 爬蟲並分頁顯示資料－IvanKao的部落格

Python Django 實作(二) 爬蟲並分頁顯示資料

本次練習包含以下項目

安裝爬蟲需要用到的套件
建立專案
分析 json 資料並分頁顯示
分析網頁資料並分頁顯示

一、安裝爬蟲需要用到的套件：

1.安裝 BeatuifulSoup4 套件

2.安裝 requests 套件

pip install BeatuifulSoup4

pip install requests

二、建立專案：

建立一個名為 crawler 的專案
建立名為 crawlerapp 的 App
建立 templates 目錄、static 目錄
建立makemigrations 資料檔，並利用 migrate 將模型與資料庫同步
完成<setting.py>的設定

三、分析 json 資料並顯示：

這次使用的範例是政府開放資料平台的動物認領養json

1.分析json

由於為了顯示單筆資料，所以會將元素加到串列中取用

接著將依照以下步驟進行：

導入分析需要用到的套件
連接網址
分析json
建立空串列
將元素添加入串列

import json
import requests
res = requests.get('http://data.coa.gov.tw/Service/OpenData/AnimalOpenData.aspx') #連接網址
ress = res.text
jd = json.loads(ress)  #分析json

animal_place=[]  #建立空串列
animal_kind=[]
animal_sex=[]
animal_bodytype=[]
animal_age=[]
album_file=[]
shelter_name=[]
shelter_address=[]
shelter_tel=[]

for item in jd:  #將元素添加入串列
    animal_place.append(item['animal_place'])
    animal_kind.append(item['animal_kind'])
    animal_sex.append(item['animal_sex'])
    animal_bodytype.append(item['animal_bodytype'])
    animal_age.append(item['animal_age'])
    album_file.append(item['album_file'])
    shelter_name.append(item['shelter_name'])
    shelter_address.append(item['shelter_address'])
    shelter_tel.append(item['shelter_tel'])

2.顯示單筆資料

在 <views.py> 中建立一個名為 crawler 的自訂函數

def crawler(request):
    res = requests.get('http://data.coa.gov.tw/Service/OpenData/AnimalOpenData.aspx') #連接網址
    ress = res.text
    jd = json.loads(ress)  #分析json


    animal_place=[]  #建立空串列
    animal_kind=[]
    animal_sex=[]
    animal_bodytype=[]
    animal_age=[]
    album_file=[]
    shelter_name=[]
    shelter_address=[]
    shelter_tel=[]


    for item in jd:  #將元素添加入串列
        animal_place.append(item['animal_place'])
        animal_kind.append(item['animal_kind'])
        animal_sex.append(item['animal_sex'])
        animal_bodytype.append(item['animal_bodytype'])
        animal_age.append(item['animal_age'])
        album_file.append(item['album_file'])
        shelter_name.append(item['shelter_name'])
        shelter_address.append(item['shelter_address'])
        shelter_tel.append(item['shelter_tel'])

    return render(request,"crawler.html",locals())

創建一個名為 <crawler.html> 的模版

這邊調用串列的第一筆資料(第0項)

但是這邊不同於python語法的 "串列[0]"

而要使用 "串列.0" 的方式來顯示資料

<!DOCTYPE html>
<html>
<head>
    <meta charset="utf-8">
    <title>顯示第一筆資料</title>
</head>
<body>
    <h2>顯示 Django json單筆資料 </h2>

    動物的實際所在地: {{animal_place.0}}  <br  />
    動物的類型: {{animal_kind.0}}  <br  />
    動物性別: {{animal_sex.0}}  <br  />
    動物體型: {{animal_bodytype.0}}  <br  />
    動物年紀: {{animal_age.0}}  <br  />
    動物所屬收容所名稱: {{shelter_name.0}}  <br  />
    圖片: <br  />
    <img src="{{album_file.0}}">      <br  />
    地址: {{shelter_address.0}}  <br  />
    聯絡電話: {{shelter_tel.0}}  <br  />





</body>
</html>

在<urls.py>添加crawler的路徑，並到瀏覽器執行：

3.顯示多筆資料

在 <views.py> 中建立一個名為 crawlerall 的自訂函數

由於多筆資料可以直接使用for迴圈來顯示，因此不用將元素加入串列

def crawlerall(request):
    res = requests.get('http://data.coa.gov.tw/Service/OpenData/AnimalOpenData.aspx')
    ress = res.text
    jd = json.loads(ress)

    return render(request,"crawlerall.html",locals())

創建一個名為 <crawlerall.html> 的模版

<!DOCTYPE html>
<html>
<head>
    <meta charset="utf-8">
    <title>顯示所有資料</title>
</head>
<body>
    <h2>顯示 json 所有資料</h2>
    <table border="3" cellpadding="2" cellspacing="2">
        <th>動物的實際所在地</th>
        <th>動物的類型</th>
        <th>動物性別</th>
        <th>動物體型</th>
        <th>動物年紀</th>
        <th>圖片</th>
        <th>動物所屬收容所名稱:</th>
        <th>地址</th>
        <th>連絡電話</th>

        {% for i in jd %}  <!-- #jd為json所有資料 i則是現在建立用來遞迴的變數 -->
        <tr>
            <td>{{i.animal_place}}</td>  <!-- #Django Template 語言 使用'.' 來顯示子項目 -->
            <td>{{i.animal_kind}}</td>
            <td>{{i.animal_sex}}</td>
            <td>{{i.animal_bodytype}}</td>
            <td>{{i.animal_age}}</td>
            <td><img src="{{i.album_file}}" width="60" height="100"></td>
            <td>{{i.shelter_name}}</td>
            <td>{{i.shelter_address}}</td>
            <td>{{i.shelter_tel}}</td>
        </tr>
        {% endfor %}
    </table>
</body>
</html>

在<urls.py>添加 crawlerall 的路徑，並到瀏覽器執行：

3.分頁顯示多筆資料

在 <views.py> 中建立一個名為 crawlerpage 的自訂函數

同學習紀錄(八) 這裡將進行分頁顯示，避免網頁資料太多爬取過久

跟之前不同的地方是，這裡是分析 json 資料來做分頁

建立一個名為 jd2 的變數，把分析好的 jd 分成10筆資料，再將 jd2 傳到前台顯示

page1 = 1
def crawlerpage(request,pageindex=None):
    global page1
    res = requests.get('http://data.coa.gov.tw/Service/OpenData/AnimalOpenData.aspx')
    ress = res.text
    jd = json.loads(ress)
    animal_place=[]

    for item in jd:
        animal_place.append(item['animal_place'])

    datasize = len(animal_place)  #資料筆數
    pagesize = 10 #每頁資料筆數
    totpage = math.ceil(datasize / pagesize)

    if pageindex == None:  #無參數
        page1 = 1
        jd2 = jd[:pagesize] #取前10筆
    elif pageindex =='1':  #上一頁
        start = (page1-2)*pagesize #該頁的第1筆資料
        if start >=0:  #有前頁資料就顯示
            jd2 = jd[start:(start+pagesize)]
            page1 -= 1
    elif pageindex == '2':  #下一頁
        start = page1*pagesize
        if start < datasize: #有下頁資料就顯示
            jd2 = jd[start:start+pagesize]
            page1 = page1 + 1
    elif pageindex == '3': #由詳細頁面返回首頁
        start = (page1-1)*pagesize  #取得原有頁面第1筆資料
        jd2 = jd[start:start+pagesize]
    currentpage = page1

    return render(request,"crawlerpage.html",locals())

創建一個名為 <crawlerpage.html> 的模版

這邊新增了顯示當前頁數的訊息

<!DOCTYPE html>
<html>
<head>
    <meta charset="utf-8">
    <title>分頁顯示 json 所有資料</title>
</head>
<body>
    <h2>分頁顯示 json 所有資料</h2>
    <table border="3" cellpadding="2" cellspacing="2">
        <th>動物的實際所在地</th>
        <th>動物的類型</th>
        <th>動物性別</th>
        <th>動物體型</th>
        <th>動物年紀</th>
        <th>圖片</th>
        <th>動物所屬收容所名稱:</th>
        <th>地址</th>
        <th>連絡電話</th>

        {% for i in jd2 %}  <!-- #jd2為json所有資料 i則是現在建立用來遞迴的變數 -->
        <tr>
            <td>{{i.animal_place}}</td>  <!-- #Django Template 語言 使用'.' 來顯示子項目 -->
            <td>{{i.animal_kind}}</td>
            <td>{{i.animal_sex}}</td>
            <td>{{i.animal_bodytype}}</td>
            <td>{{i.animal_age}}</td>
            <td><img src="{{i.album_file}}" width="60" height="100"></td>
            <td>{{i.shelter_name}}</td>
            <td>{{i.shelter_address}}</td>
            <td>{{i.shelter_tel}}</td>
        </tr>
        {% endfor %}
    </table>

    <h3><div class="topfunction" align="center">
        {% if currentpage > 1 %}
            <a href="/crawlerpage/1/" title="上一頁">上一頁</a>
        {% endif %}
        　　-當前在第 {{currentpage}} 頁 -　　
        {% if currentpage < totpage %}
            <a href="/crawlerpage/2/" title="下一頁">下一頁</a>
        {% endif %}
    </div></h3>
</body>
</html>

在<urls.py>添加 crawlerpage 的路徑，並到瀏覽器執行：

    path('crawlerpage/',views.crawlerpage),
    re_path('crawlerpage/(\d+)/$',views.crawlerpage), #要導入 re_path 套件

四、分析網頁資料並顯示：

這次以蘋果日報即時新聞為範例，先爬取資料，再藉由Django網頁顯示

1.分析網頁

以F12檢視網頁，接下來要找到日期、時間、分類、標題、網址五項資料

日期在 <h1 class="dddd">標籤

時間在 <ul class="rtddd slvl">標籤底下的 <time>

分類在 <ul class="rtddd slvl">標籤底下的 <h2>

標題在 <ul class="rtddd slvl">標籤底下的 <h1>

網址在 <ul class="rtddd slvl">標籤底下的 <a href>

在 <views.py> 中建立一個名為 newscrawler 的自訂函數

接著爬蟲程式將依照以下步驟撰寫：

導入爬蟲需要用到的套件
連接網址
擷取網頁資料
建立空串列
將元素添加入串列
使用zip()函式集合串列

from bs4 import BeautifulSoup
import re
import urllib

def newscrawler(request):

    url = 'https://tw.appledaily.com/new/realtime'   #選擇網址
    user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 6.1; zh-CN; rv:1.9.2.15) Gecko/20110303 Firefox/3.6.15' #偽裝使用者
    headers = {'User-Agent':user_agent}
    data_res = urllib.request.Request(url=url,headers=headers)
    data = urllib.request.urlopen(data_res)
    data = data.read().decode('utf-8')  
    sp = BeautifulSoup(data, "html.parser")


    #當日日期
    date = sp.find("h1",{"class":"dddd"})
    print(date.text)

    #每筆資料的時間
    time= []
    times = sp.find("ul",{"class":"rtddd slvl"}).findAll("time")
    for time1 in times:
        time.append(time1.text)

    #分類
    category=[]
    categorys = sp.find("ul",{"class":"rtddd slvl"}).findAll("h2")
    for category1 in categorys:
        category.append(category1.text)

    #標題
    title=[]
    titles = sp.find("ul",{"class":"rtddd slvl"}).findAll("h1")
    for title1 in titles:
        title.append(title1.text)
    
    #網址
    link=[]
    links = sp.find("ul",{"class":"rtddd slvl"}).findAll("a",href = re.compile('appledaily'))
    for link1 in links:
        link.append(link1['href'])
    all = zip(time,category,title,link)
    return render(request,"newscrawler.html",locals())

創建一個名為 <newscrawler.html> 的模版

<!DOCTYPE html>
<html>
<head>
    <meta charset="utf-8">
    <title>蘋果日報 即時新聞</title>
</head>
<body>
    <h2>顯示 蘋果日報 即時新聞 {{date}}</h2>
    <table border="3" cellpadding="2" cellspacing="2">
        <th>時間</th>
        <th>分類</th>
        <th>標題</th>
        <th>網址</th>


        {% for time,category,title,link in all %} <!-- 使用zip將串列集合後即可同時調用 -->
        <tr>
            <td>{{time}}</td>
            <td>{{category}}</td>
            <td><a  Target="_blank" href="{{link}}">{{title}}</td>  <!-- 以新視窗開啟連結 -->
            <td>{{link}}</td>
        </tr>
        {% endfor %}
    </table>
</body>
</html>

在<urls.py>添加 newscrawler 的路徑，並到瀏覽器執行：

3.分頁顯示多筆資料

在 <views.py> 中建立一個名為 newspage 的自訂函數

同學習紀錄(八) 這裡將進行分頁顯示，避免網頁資料太多爬取過久

跟之前不同的地方是，這裡是要將網路爬蟲的多個 List 加入一個 zip

再把zip 放入一個名為 listall 的 list ，

而裡面的元素則改為時間：listall[0] 、分類：listall[1] 、標題：listall[2] 、網址：listall[3]、

建立一個名為listall2 的變數，把分析好的 listall 分成6筆資料，再將 listall2 傳到前台顯示

page2 = 1
def newspage(request,pageindex2=None):  #蘋果日報即時新聞爬蟲

    url = 'https://tw.appledaily.com/new/realtime'   #選擇網址
    user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 6.1; zh-CN; rv:1.9.2.15) Gecko/20110303 Firefox/3.6.15' #偽裝使用者
    headers = {'User-Agent':user_agent}
    data_res = urllib.request.Request(url=url,headers=headers)
    data = urllib.request.urlopen(data_res)
    data = data.read().decode('utf-8')  
    sp = BeautifulSoup(data, "html.parser")
    #當日日期
    dated = sp.find("h1",{"class":"dddd"})
    date = dated.text
    #每筆資料的時間
    time= []
    times = sp.find("ul",{"class":"rtddd slvl"}).findAll("time")
    for time1 in times:
        time.append(time1.text)
    #分類
    category=[]
    categorys = sp.find("ul",{"class":"rtddd slvl"}).findAll("h2")
    for category1 in categorys:
        category.append(category1.text)
    #標題
    title=[]
    titles = sp.find("ul",{"class":"rtddd slvl"}).findAll("h1")
    for title1 in titles:
        title.append(title1.text)
    #網址
    link=[]
    links = sp.find("ul",{"class":"rtddd slvl"}).findAll("a",href = re.compile('appledaily'))
    for link1 in links:
        link.append(link1['href'])
    all = zip(time,category,title,link)
    listall = list(all)

    global page2
    datasize = len(time)
    pagesize = 6
    totpage = math.ceil(datasize / pagesize)

    if pageindex2 == None:  #無參數
        page2 = 1
        listall2 = listall[:pagesize] #取前6筆
    elif pageindex2 =='1':  #上一頁
        start = (page2-2)*pagesize #該頁的第1筆資料
        if start >=0:  #有前頁資料就顯示
            listall2 = listall[start:(start+pagesize)]
            page2 -= 1
    elif pageindex2 == '2':  #下一頁
        start = page2*pagesize
        if start < datasize: #有下頁資料就顯示
            listall2 = listall[start:start+pagesize]
            page2 = page2 + 1
    elif pageindex2 == '3': #由詳細頁面返回首頁
        start = (page2-1)*pagesize  #取得原有頁面第1筆資料
        listall2 = listall[start:start+pagesize]
    currentpage = page2

    return render(request,"newspage.html",locals())

創建一個名為 <newspage.html> 的模版

<!DOCTYPE html>
<html>
<head>
    <meta charset="utf-8">
    <title>分頁顯示 蘋果日報 即時新聞</title>
</head>
<body>
    <h2>分頁顯示 蘋果日報 即時新聞 {{date}}</h2>
    <table border="3" cellpadding="2" cellspacing="2">
        <th>時間</th>
        <th>分類</th>
        <th>標題</th>
        <th>網址</th>


        {% for i in listall2 %} 
        <tr>
            <td>{{i.0}}</td>
            <td>{{i.1}}</td>
            <td><a  Target="_blank" href="{{i.3}}">{{i.2}}</td>  <!-- 以新視窗開啟連結 -->
            <td>{{i.3}}</td>
        </tr>
        {% endfor %}
    </table>
    <h3><div class="topfunction" align="center">
        {% if currentpage > 1 %}
            <a href="/newspage/1/" title="上一頁">上一頁</a>
        {% endif %}
        　　-當前在第 {{currentpage}} 頁 -　　
        {% if currentpage < totpage %}
            <a href="/newspage/2/" title="下一頁">下一頁</a>
        {% endif %}
    </div></h3>
</body>
</html>

在<urls.py>添加 newspage 的路徑，並到瀏覽器執行：

    path('newspage/',views.newspage),
    re_path('newspage/(\d+)/$',views.newspage),

ivankao

IvanKao的部落格

ivankao 發表在痞客邦留言(1) 人氣()

E-mail轉寄

IvanKao的部落格

我把自己學習的歷程記錄在這個部落格。喜歡自己研究，然後沉浸在失敗中除錯，也時常一邊學習一邊將網路上、書籍上的資訊整理起來，或許能幫助遇上同樣問題的朋友。