はじめに

スクレイピングとはWebサイトからデータを収集すること．その収集するロボットのことをクローラーと呼ぶ．
やっていいサイト，やっていけないサイトあるので注意．
ログインとかボタン押すなどの自動操作も可能（Seleniumライブラリなど）．Web画面のスクショもできる．

このペーじ，pythonのライブラリBeautifu SoupとSeleniumの説明をする．

Beautiful Soup

名前は不思議の国のアリスから．
HTML，XMLを解析し，データを取得してくれる．

公式
https://www.crummy.com/software/BeautifulSoup/bs4/doc/（2023年1月）

解析するためのはじめの一歩

解析するやつをパーサーと呼ぶ．
python標準ライブラリに含まれるHTMLパーサ以外にも，サードpythonパーサをサポートしてる．
ここら辺はよくわからん．

やり方

ドキュメントの解析はSoupオブジェクトが行う．
BeautifulSoupコンストラクタに渡して，Soupオブジェクトを作成する．
コンストラクタによって，Unicodeに変換される．

HTMLファイルのSoupオブジェクトを作成

from bs4 import BeautifulSoup

with open("index.html") as fp:
    soup = BeautifulSoup(fp, 'html.parser')

"html.pearser"はpython標準ライブラリに含まれるパーサ．

HTML文字列のSoupオブジェクトを作成

soup = BeautifulSoup("<html>a web page</html>", 'html.parser')

扱うオブジェクト

なんと4つのみのオブジェクトを扱うだけでHTMLを操れる．

それぞれprettify()メソッドでインデントつけて表示してくれる．

BeautifulSoupオブジェクト

解析したドキュメントの全体のオブジェクト．

name属性は''[document]'．

Tagオブジェクト

BeautifulSoupオブジェクトから，HTMLのタグと同じ名前のTagオブジェクトを扱える．

soup = BeautifulSoup('<b class="boldest">Extremely bold</b>', 'html.parser')
tag = soup.b
type(tag)
# <class 'bs4.element.Tag'>

name属性

HTMLの各タグ名．
上書き可能で，上書きしたタグになる．

attrs属性

そのTagオブジェクトにあるHTMLの全属性．辞書型．

HTMLの属性の参照

辞書のように属性を扱える．

tag = BeautifulSoup('<b id="boldest">bold</b>', 'html.parser').b
tag['id']
# 'boldest'

これも追加，削除，変更が可能．

リストとして取得することもできる．

id_soup.p.get_attribute_list('id')
# ["my id"]

HTMLの属性の値が複数ある場合

複数の値をもてるHTMLの属性（class, rel, rev, accept-charset, headers, accesskey等）の値はリストとして扱うことができる．
BeautifulSoupで複数の値を設定したい場合も，リストで渡す．

css_soup = BeautifulSoup('<p class="body strikeout"></p>', 'html.parser')
css_soup.p['class']
# ['body', 'strikeout']

リストではなく，文字列のまま扱うためには，BeautifulSoupオブジェクトのコンストラクタに multi_valued_attributes=None を渡す．

no_list_soup = BeautifulSoup('<p class="body strikeout"></p>', 'html.parser', multi_valued_attributes=None)
no_list_soup.p['class']
# 'body strikeout'

NavigableStringオブジェクト

テキストのオブジェクト．
Tagオブジェクトのstringから取得可能．

soup = BeautifulSoup('<b class="boldest">Extremely bold</b>', 'html.parser')
tag = soup.b
tag.string
# 'Extremely bold'
type(tag.string)
# <class 'bs4.element.NavigableString'>

これは上書きができない．

文字列の変換

tag.string.replace_with("No longer bold")
tag
# <b class="boldest">No longer bold</b>

Commentオブジェクト

NavigableStringオブジェクトの特別ver. みたいなものだそう．

markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"
soup = BeautifulSoup(markup, 'html.parser')
comment = soup.b.string
type(comment)
# <class 'bs4.element.Comment'>

うぇぶぺーじを解析する

許可不許可のかくにん

URLで次のように '/robots.txt' を追加して検索すると大体宣言ページがある．
https://hogehoge.com/robots.txt

クローラ（スクレイピングするマシーン）の宣言なので，不許可でもやろうと思えばできます．

例：とあるサイトのrobots.txtページ

User-agent: *
Allow: /wp-admin/admin-ajax.php
Disallow: /wp-admin/

User-agent: *
全てのユーザ対象．

Allow
クローラOKぺーじです！

Disallow
クローラ対象外ぺーじです！

うぇぶぺーじからデータとってくる

requestライブラリをつかうのでインストールしましょう．

例：hogehoge.comの情報とってくる

from bs4 import BeautifulSoup
import requests


url = "https://hogehoge.com"
req = requests.get(url)

soup = BeautifulSoup(req.text, 'html.parser')
print(soup.text)

基本的な情報検索メソッド

// はじめの一個のみ
soup.find(name, attrs, recursive, string, **kwargs)

// 合致したもの全て（数はlimit引数で指定可能）
soup.find_all(name, attrs, recursive, string, limit, **kwargs)

name:

タグの名前

**kwargs

属性を検索できる
attr='value' の形で引数に渡す．
なお，class はpythonの予約語にあるので， class属性を指定する場合は，'class_' で指定する．
クラス属性の内，一つでも合致したら取得される．
class='ho ge' に対して，find_all(class='ho ge') は合致しない．

attrs

一部属性（name，data-*等）はキーワード引数として使用できないものもある．
これらに対して，辞書にしてattrs引数に渡す．

soup.find_all(attrs={"data-foo": "value"})

string

.stringが一致するものを取得する．

limit: int

最大取得数．

recursive=True:

全ての子孫を調べるか．
子のみにしたいならFalseを指定する．

うぇぶぺーじから取得するデータに関して，フィルターをかけることができる．

string，正規表現

それに合致するもの，
4.4.0以降．以前はtextと呼ばれていた．

関数

「class属性がありかつid属性がないもの」等の条件つけられる．boolean型のものでないとだめ．

True
全て合致．

使用例例

titleタグ検索

soup.find_all('title')

id="hoge"を検索

soup.find_all(id='hoge')

hrefに"hoge"が含まれるものを検索

soup.find_all(href=re.compile('hoge'))

id属性があるものを検索

soup.find_all(id=True)

複数の属性をAND条件で検索

soup.find_all(id='hoge-id', href=re.compile('hoge'))

複数のクラス属性をAND条件で検索するには，find_all()とは別のselect()を使う．CSSセレクタで指定できる．

soup.select("p.ho.ge")

stringで文字列を検索

// 完全一致
soup.find_all(string='hoge')

// 部分一致
soup.find_all(string=re.complie('ho')

stringのみ検索では文字列のみが返却されるので注意．
一つ目の例だと['hoge'] が返却される．

タグとstringでAND条件で検索??

soup.find_all('a', string='hoge')

引数string=""ではなく，text=""ってやった方が良いのかも．

それ以前/以後にある同階層にあるタグの検索

// それ以前，複数および1個のみ
find_previous_siblings(name, attrs, string, limit, **kwargs)
find_previous_sibling(name, attrs, string, **kwargs)

// それ以降，複数および1個のみ
find_next_siblings(name, attrs, string, limit, **kwargs)
find_next_sibling(name, attrs, string, **kwargs)

CSSセレクタで検索する

特定の階層構造を検索するならこっちの方が簡単そうな？

例：

// タグの下のタグ
soup.select('body a')

// タグの直下のタグ
soup.select('p > a')

// タグの同階層
soup.select('#link1 ~ .sister')

ツリー間を辿っていく

タグ名の属性があるので，それを辿れる．
同じ名前のタグがある場合は，はじめの一個目が返される．
1番の親はBeautfulSoupオブジェクト

soup.body.b

複数の子を辿る

リストで取得する場合

soup.contens

イテレートする場合

for child in soup.children:
  print(tag)

直下でない子供も辿る

for child in soup.descendants:
  print(child)

子供のStringを見る

Stringがある唯一の子供を持つ場合

親の.Stringは子の.Stringと一致する

<head> <title> HOGE </title> </head>

head_tag.string
> 'HOGE'

複数の子供がいる場合は，どれを返せば良いかわからんのでNoneになる．

全てのStringを表示する（ジェネレータ）

// 改行文字も表示する
for string in soup.strings:
  print(repr(string))

// 改行文字を表示しない
for string in soup.stripped_strings:
  print(repr(string))

親を辿る

title_tag.parent

BeuautifulSoupオブジェクトの親はNone

兄弟を辿る

sibling_soup.next_sibling
sibling_soup.previous_sibling

タグ以外に文字列（改行文字含む）があると，それも兄弟判定．
よって，隣のタグをたどりたい場合は，2回siblingを呼び出す必要あり．

<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>

link = link.a
link
# <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>

link.next_sibling
# '\n'
link.next_sibling.next_sibling
# <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>

イテレートする

複数にするとジェネレータになる？

for sib in sibling_soup.next_siblings:
  print(sib)

要素を辿る

soup.next_element
soup.previous_element

例：
.html

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

.py

for element in last_a_tag.next_elements:
    print(repr(element))
# 'Tillie'
# ';\nand they lived at the bottom of a well.'
# '\n'
# <p class="story">...</p>
# '...'
# '\n'