1.通配符做字符串匹配
fnmatch()模块提供了两个函数--fnmatch()和fnmatchcase()可用来执行匹配。
>>> from fnmatch import fnmatch, fnmatchcase
>>> fnmatch('fo.txt', '*.txt')
True
>>> fnmatch('fo.txt', '?o.txt')
True
fnmatch()的匹配模式所采用的大小写区分规则与系统相同。
# mac
>>>fnmatch('foo.txt', '*.TXT')
False
# windows
>>>fnmatch('foo.txt', '*.TXT')
True
而fnmatchcase(),则完全按照我们的意愿来匹配。
>>> fnmatchcase('foo.txt', '*.TXT')
False
2.Unicode文本统一为规范形式
有些特定的字符会被表示成合法的代码点序列
>>> s = 'Spicy Jalape\u00f1o'
>>> s
'Spicy Jalapeño'
>>> s2 = 'Spicy Jalapen\u0303o'
>>> s2
'Spicy Jalapeño'
>>> s == s2
False
为了解决这个问题,应该先将文本统一为规范的形式,可以通过unicodedata()来完成:
>>> import unicodedata
>>> a = unicodedata(s)
>>> a = unicodedata.normalize('NFC', s)
>>> b = unicodedata.normalize('NFC', s2)
>>> a
'Spicy Jalapeño'
>>> b
'Spicy Jalapeño'
>>> a == b
True
>>> a = unicodedata.normalize('NFD', s)
>>> b = unicodedata.normalize('NFD', s2)
>>> a == b
True
NFC表示字符应该全组成(如果可能使用单个代码点),NFD使用组合字符,每个字符应该能完全拆开。
此外,还有NFKC和NFKD。
3.文本过滤和清理
介绍字符串方法translate().
>>> s = 'python\fis\tawsome\r\n'
>>> s
'python\x0cis\tawsome\r\n'
>>> remap = {ord('\f'): ' ', ord('\t'): ' ', ord('\r'): None}
>>> a = s.translate(remap)
>>> a
'python is awsome\n'
4.以固定的列数重新格式化文本
>>> text = "Binzhou Medical University is a common medical university at provincial level in Shandong Province,and its predecessor was the Public Medical School of Shandong University originally established in 1946. The university follows its school-running tradition that “teaching comes first, and quality is prior to any others” and that “the education of man is a fundamental, and moral education has the priority”, and puts into practices its university dictum of “benevolent mind and wonderful skills”. Sticking to the centeredness on the cultivation of talents and that..."
>>> import textwrap
>>> textwrap.fill(text, 80)
'Binzhou Medical University is a common medical university at provincial level in\nShandong Province,and its predecessor was the Public Medical School of Shandong\nUniversity originally established in 1946. The university follows its school-\nrunning tradition that “teaching comes first, and quality is prior to any\nothers” and that “the education of man is a fundamental, and moral education has\nthe priority”, and puts into practices its university dictum of “benevolent mind\nand wonderful skills”. Sticking to the centeredness on the cultivation of\ntalents and that...'
fill()还有initial_indent,subsequent_indent参数可选,用initial_indent控制首行的缩进,用subsequent_indent控制其他行的缩进。
5.处理HTML和XML实体
s = '<html></html>'
import html
print(html.escape(s)) # <html></html>
from html import unescape
print(unescape(html.escape(s))) # <html></html>
xml解码:
from xml.sax.saxutils import unescape
评论列表
已有0条评论