【NLP】02 Basic Text Processing

Regular expressions

a formal language for specifying text strings

[A-Z] [a-z] [0-9]
[^A-Z] [^e] [^!] [^A-Za-z]
groundhog woodchuck a b c [gG]roundhog [wW]oodchuck
colou?r oo*h! o+h! baa+ beg.n
^[A-Z] [A-Z]$ ^[^A-Za-z] .$ .$

Error

False positives(TypeI):
Matching strings that we should not have matched
False negaGves(TypeII):
Not matching things that we should have matched

antagonistic efforts:

Increasing accuracy or precision(minimizing false positives)
Increasing coverage or recall(minimizing false negatives)

Summary

Regular expressions play a suprisingly large role.
For many hard tasks, we using machine learning classifiers.
But regular expressions are used as features in the classifiers.
Can be very useful in capturing generalizations.

Regular expressions in practical NLP

Word tokenization

text normalization（正常化，标准化）

Segmenting/tokenizing words in running text
Normlizing word formats
Segmenting sentences in running text