字符串文本处理

Python 2020-07-01 1235

1.通配符做字符串匹配

fnmatch()模块提供了两个函数--fnmatch()和fnmatchcase()可用来执行匹配。

>>> from fnmatch import fnmatch, fnmatchcase  
>>> fnmatch('fo.txt', '*.txt')  
True  
>>> fnmatch('fo.txt', '?o.txt')  
True

fnmatch()的匹配模式所采用的大小写区分规则与系统相同。

# mac  
>>>fnmatch('foo.txt', '*.TXT')  
False  
# windows  
>>>fnmatch('foo.txt', '*.TXT')  
True

而fnmatchcase()，则完全按照我们的意愿来匹配。

>>> fnmatchcase('foo.txt', '*.TXT')  
False

2.Unicode文本统一为规范形式

有些特定的字符会被表示成合法的代码点序列

>>> s = 'Spicy Jalape\u00f1o'  
>>> s  
'Spicy Jalapeño'  
>>> s2 = 'Spicy Jalapen\u0303o'  
>>> s2  
'Spicy Jalapeño'  
>>> s == s2  
False

为了解决这个问题，应该先将文本统一为规范的形式，可以通过unicodedata()来完成：

>>> import unicodedata  
>>> a = unicodedata(s)  
>>> a = unicodedata.normalize('NFC', s)  
>>> b = unicodedata.normalize('NFC', s2)  
>>> a  
'Spicy Jalapeño'  
>>> b  
'Spicy Jalapeño'  
>>> a == b  
True  
>>> a = unicodedata.normalize('NFD', s)  
>>> b = unicodedata.normalize('NFD', s2)  
>>> a == b  
True

NFC表示字符应该全组成（如果可能使用单个代码点），NFD使用组合字符，每个字符应该能完全拆开。
此外，还有NFKC和NFKD。

3.文本过滤和清理

介绍字符串方法translate().

>>> s = 'python\fis\tawsome\r\n'  
>>> s  
'python\x0cis\tawsome\r\n'  
>>> remap = {ord('\f'): ' ', ord('\t'): ' ', ord('\r'): None}  
>>> a = s.translate(remap)  
>>> a  
'python is awsome\n'

4.以固定的列数重新格式化文本

>>> text = "Binzhou Medical University is a common medical university at provincial level in Shandong Province，and its predecessor was the Public Medical School of Shandong University originally established in 1946. The university follows its school-running tradition that “teaching comes first, and quality is prior to any others” and that “the education of man is a fundamental, and moral education has the priority”, and puts into practices its university dictum of “benevolent mind and wonderful skills”. Sticking to the centeredness on the cultivation of talents and that..."  
>>> import textwrap  
>>> textwrap.fill(text, 80)  
'Binzhou Medical University is a common medical university at provincial level in\nShandong Province，and its predecessor was the Public Medical School of Shandong\nUniversity originally established in 1946. The university follows its school-\nrunning tradition that “teaching comes first, and quality is prior to any\nothers” and that “the education of man is a fundamental, and moral education has\nthe priority”, and puts into practices its university dictum of “benevolent mind\nand wonderful skills”. Sticking to the centeredness on the cultivation of\ntalents and that...'

fill()还有initial_indent,subsequent_indent参数可选，用initial_indent控制首行的缩进，用subsequent_indent控制其他行的缩进。
5.处理HTML和XML实体

s = '<html></html>'  
import html  
print(html.escape(s))  # &lt;html&gt;&lt;/html&gt;  
from html import unescape  
print(unescape(html.escape(s)))  # <html></html>

xml解码：
from xml.sax.saxutils import unescape

 标签：Python

上一篇处理数字、日期和时间

下一篇 Number类型，String类型和单体内置对象

文章评论

评论列表

已有0条评论

淡淡的忧伤博客

字符串文本处理

1.通配符做字符串匹配

2.Unicode文本统一为规范形式

3.文本过滤和清理

4.以固定的列数重新格式化文本

文章评论

评论列表

 热门文章

字符串文本处理

1.通配符做字符串匹配

2.Unicode文本统一为规范形式

3.文本过滤和清理

4.以固定的列数重新格式化文本

文章评论

评论列表

 热门文章

 标签云

 倾心推荐