25 Oct 2018
【NLP】02 Basic Text Processing
Regular expressions
a formal language for specifying text strings
- [A-Z] [a-z] [0-9]
- [^A-Z] [^e] [^!] [^A-Za-z]
-
groundhog woodchuck a b c [gG]roundhog [wW]oodchuck - colou?r oo*h! o+h! baa+ beg.n
- ^[A-Z] [A-Z]$ ^[^A-Za-z] .$ .$
Error
- False positives(TypeI):
Matching strings that we should not have matched - False negaGves(TypeII):
Not matching things that we should have matched
antagonistic efforts:
- Increasing accuracy or precision(minimizing false positives)
- Increasing coverage or recall(minimizing false negatives)
Summary
Regular expressions play a suprisingly large role.
For many hard tasks, we using machine learning classifiers.
But regular expressions are used as features in the classifiers.
Can be very useful in capturing generalizations.
Regular expressions in practical NLP
Word tokenization
text normalization(正常化,标准化)
- Segmenting/tokenizing words in running text
- Normlizing word formats
- Segmenting sentences in running text
Til next time,
gentlesnow
at 12:01
