BeautifulSoup使用小结

最新再在做html网页文档解析相关工作，用到了BeautifulSoup，用于提取网页数据，现在简单做个总结。

BeautifulSoup又称bs4，是编写 python 爬虫常用库之一，主要用来解析 html 标签。

一、初始化

pip install bs4

from bs4 import BeautifulSoup

soup = BeautifulSoup("<html>A Html Text</html>", "html.parser")

两个参数：第一个参数是要解析的html文本，第二个参数是使用那种解析器，对于HTML来讲就是html.parser，这个是bs4自带的解析器。

如果一段HTML或XML文档格式不正确的话，那么在不同的解析器中返回的结果可能是不一样的。

格式化输出

soup.prettify() # prettify 有括号和没括号都可以

二、基本使用

from bs4 import BeautifulSoup

# 构造一个网页数据

html_doc = """

<html>

<head>

<title>The Dormouse's story</title>

</head>

<body>

The Dormouse's story

Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>

and they lived at the bottom of a well.

...

</body>

</html>

"""

2.1 获取标签

res = BeautifulSoup(html_doc, 'lxml')

print(res.a)

2.2 获取标签内文本

print(res.a.text)

2.3 获取标签内属性

print(res.a.attrs)

2.4 获取指定属性值

print(res.a.attrs.get('href'))

print(res.a.get('href'))

2.5 获取子节点

for i in res.p.children:

print(i)

2.6 获取标签内部所有的元素

print(res.p.contents)

2.7 获取标签的父标签

print(res.p.parent)

2.8 获取最上级节点

for i in res.p.parents:

print(i)

三、bs4核心库

3.1 find

只能找符合条件的第一个该方法的返回结果是一个标签对象

3.1.1 查找指定标签名的标签默认只找符合条件的第一个

print(res.find(name='p'))

3.1.2 查找具有某个特定属性的标签默认只找符合条件的第一个

print(res.find(name='p', id='title'))

3.1.3 为了解决关键字冲突会加下划线区分

print(res.find(name='p', class_='title'))

3.1.4 使用attrs参数直接避免冲突

print(res.find(name='p', attrs={'class': 'title'}))

3.2 find_all

查找所有符合条件的标签该方法的返回结果是一个列表。

3.2.1 查询某一个标签，查找的结果是一个列表

print(res.find_all('a'))

以上就是bs4的使用小结！

菜单

评论

代码提交相关规范

前端预加载图片

lxml库之etree使用小结

BeautifulSoup使用小结

JavaScript无重复字符的最长子串(利用数组解法)

Matplotlib图表

Shadcn介绍

Clickhouse 的查询优化详解

linux安装oracle

robotframework脚本常用关键字总结