clean Package¶
duplicate
Module¶
html
Module¶
HTML Clean Module
Unescape the HTML entity (name or codepoint form) to unicode char, remove html tags.
identical
Module¶
length_diff
Module¶
Predicate Module: Length Distance
length_limit
Module¶
Predicate Module: Length Limit
lowercase
Module¶
Lowercase module for corpus clean tool
newline
Module¶
regex
Module¶
Regular expression clean module.
-
class
corpustools.clean.regex.
RegexClean
(clean, step)[source]¶ Bases:
object
Class RegexClean run regular expression clean on source and target corpus.
-
compile_relist
()[source]¶ Compile the regular expressions to re objects before using them to improve performance. The compiled pattern is assigned back to clean step to replace the string form of pattern.
-
re_clean
(sentence)[source]¶ Clean the sentence with clean step, return cleaned corpus sentence.
Parameters: sentence – unicode string, corpus sentence. Example of clean step.
{ "description": "delete cdata", "action": "replace", "pattern" : "CDATA", "repl" : "", "apply_to": "source", "unicode": true, "case_sensitive": true, "log": "detail" }
-
re_del
(sentence, pattern)[source]¶ Return empty string if pattern matched.
Parameters: - sentence – unicode string, corpus sentence.
- pattern – re object.
-
sentence_ratio
Module¶
Predicate Module: Sentences Ratio
similar
Module¶
tokenize
Module¶
Tokenize module in corpus clean tools
-
corpustools.clean.tokenize.
run
(clean, tools, step)[source]¶ Clean module interface function, run tokenization for corpus files.
-
corpustools.clean.tokenize.
tokenize
(clean, tools, step, lang)[source]¶ Tokenize the corpus files in corpus clean working directroy.
Actually, this function works as router to dispatch the request to tokenizers in token subpackage. The modules in token subpackage are adapters to external tokenizer tools.
Parameters: - clean – corpus clean configuration.
- tools – external tools configuration.
- step – clean step.
- lang – specify the language of which corpus be tokenize.
url
Module¶
URL Clean Module
Clean the URL-like text as I can.
-
class
corpustools.clean.url.
URLClean
(clean, step)[source]¶ Bases:
object
Class of cleaning the url-like text from corpus.
-
COUNTRY_ROOT
= ['uk', 'eu', 'au', 'br', 'ca', 'cn', 'ch', 'cz', 'de', 'dk', 'es', 'fi', 'fr', 'gr', 'hk', 'hu', 'ie', 'il', 'in', 'it', 'jp', 'kr', 'nl', 'no', 'pl', 'pt', 'ro', 'ru', 'se', 'sg', 'tr', 'tw', 'ua', 'us']¶
-
GENERAL_ROOT
= ['com', 'org', 'net', 'edu', 'gov', 'info', 'int', 'tv']¶
-
PROTOCAL
= ['http', 'https', 'ftp', 'ldap']¶
-
zstring
Module¶
ZString Sequence Clean Module
Support conversion for Adobe’s ZString escape sequence.