BeautifulSoupのドキュメントの改ざん

BeautifulSoup

　Beautiful Soupは取り込んだHTMLやXMLドキュメントを以下のように改ざんします。

終了タグが存在しない場合、それを適切と思われる位置に挿入する
文字コードをUnicodeに変更
エンコーディング宣言をutf-8に変更
タグ名を小文字に変更

　オリジナルのドキュメントはエンコードに関する情報を持っているかも知れません、しかしBeautiful SorpはドキュメントのUTF-8に変換してしまうので、これを書き換えます。

#coding: UTF-8
from BeautifulSoup import BeautifulSoup

soup = BeautifulSoup( u"<meta http-equiv=\"Content-type\" content=\"text/html; charset=\"Shift_JIS\" ><html><h1>Heading</h1><p>ねこや書店</p></html>".encode('shift-jis'),  fromEncoding='shift-jis' )

print str(soup)<br>

　上記のスクリプトではShift-JISのHTMLを読み込みますが、これを実行するとBeautifl Soupはcharsetの宣言を以下のように内部のデータ構造にエンコードにあわせてutf-8に書き換えます。

<meta http-equiv="Content-type" content="text/html; charset=utf-8" shift_jis="Shift_JIS" /><html><h1>Heading</h1><p>ねこや書店</p></html>

▼ Property

datePublished	2011-01-01
dateModified	2018-06-27
author	アセンブラの魔女
headline	Python用HTML/XMLパーサー「BeautifulSoup」がどのようなドキュメント改ざんを行うかについての説明ページです
keywords	BeautifulSoup
keywords	Python
keywords	XMLパーサー
keywords	HTMLパーサー
keywords	エンコーディング
keywords	タグ名
publisher	name= wiredFish, logo.name= wiredFish, logo.url= https://books-nekoya.jp/Programming/chigu-hagu-title-01.png size= 208 pixel x 50 pixel
image.url	url= https://books-nekoya.jp/Programming/chigu-hagu-title-01.png , size= 208 pixel x 50 pixel