clean Package

duplicate Module

html Module

HTML Clean Module

Unescape the HTML entity (name or codepoint form) to unicode char, remove html tags.

corpustools.clean.html.clean_html(parser, line)[source]

Unescape xml escape sequences, html entities and clean html tags.

corpustools.clean.html.clean_htmltag(line)[source]

clean html tags.

corpustools.clean.html.run(clean_config, corpustools_config, step)[source]

entry function.

corpustools.clean.html.validate(step)[source]

identical Module

length_diff Module

Predicate Module: Length Distance

corpustools.clean.length_diff.predicate(source, target, constraint)[source]

Return True if the distance between source and target is beyond the limit.

corpustools.clean.length_diff.validate(step)[source]

length_limit Module

Predicate Module: Length Limit

corpustools.clean.length_limit.predicate(source, target, constraint)[source]

Return True if the length of source and/or target is beyond the limit.

The length limit for GIZA++ in moses is 100 tokens.

corpustools.clean.length_limit.validate(step)[source]

lowercase Module

Lowercase module for corpus clean tool

corpustools.clean.lowercase.lowercase_corpus(clean, lang, ext)[source]

Lowercase corpus files, dispatch the lowercase request to lowercase module in corpustools.case.

corpustools.clean.lowercase.run(clean, tools, step)[source]

Clean module interface function, lowercase corpus files.

newline Module

regex Module

Regular expression clean module.

class corpustools.clean.regex.RegexClean(clean, step)[source]

Bases: object

Class RegexClean run regular expression clean on source and target corpus.

compile_relist()[source]

Compile the regular expressions to re objects before using them to improve performance. The compiled pattern is assigned back to clean step to replace the string form of pattern.

re_clean(sentence)[source]

Clean the sentence with clean step, return cleaned corpus sentence.

Parameters:sentence – unicode string, corpus sentence.

Example of clean step.

{
  "description": "delete cdata",
  "action": "replace",
  "pattern" : "CDATA",
  "repl" : "",
  "apply_to": "source",
  "unicode": true,
  "case_sensitive": true,
  "log": "detail"
}
re_del(sentence, pattern)[source]

Return empty string if pattern matched.

Parameters:
  • sentence – unicode string, corpus sentence.
  • pattern – re object.
re_repl(sentence, pattern, repl)[source]

Return substituted sentence.

Parameters:
  • sentence – unicode string, corpus sentences.
  • pattern – re object.
  • repl – unicode string.
relist_clean(line)[source]

Clean the line with a list of re steps.

run()[source]

clean the corpus.

corpustools.clean.regex.run(clean_config, corpustools_config, step)[source]

entry function.

corpustools.clean.regex.validate(step)[source]

sentence_ratio Module

Predicate Module: Sentences Ratio

corpustools.clean.sentence_ratio.predicate(source, target, constraint)[source]

Return True if the sentences ratio is beyond the threshold.

sentences ratio = source length / target length or target length / source length

The threshold of ratio in moses system is 9.

corpustools.clean.sentence_ratio.validate(step)[source]

similar Module

tokenize Module

Tokenize module in corpus clean tools

corpustools.clean.tokenize.run(clean, tools, step)[source]

Clean module interface function, run tokenization for corpus files.

corpustools.clean.tokenize.tokenize(clean, tools, step, lang)[source]

Tokenize the corpus files in corpus clean working directroy.

Actually, this function works as router to dispatch the request to tokenizers in token subpackage. The modules in token subpackage are adapters to external tokenizer tools.

Parameters:
  • clean – corpus clean configuration.
  • tools – external tools configuration.
  • step – clean step.
  • lang – specify the language of which corpus be tokenize.

url Module

URL Clean Module

Clean the URL-like text as I can.

class corpustools.clean.url.URLClean(clean, step)[source]

Bases: object

Class of cleaning the url-like text from corpus.

COUNTRY_ROOT = ['uk', 'eu', 'au', 'br', 'ca', 'cn', 'ch', 'cz', 'de', 'dk', 'es', 'fi', 'fr', 'gr', 'hk', 'hu', 'ie', 'il', 'in', 'it', 'jp', 'kr', 'nl', 'no', 'pl', 'pt', 'ro', 'ru', 'se', 'sg', 'tr', 'tw', 'ua', 'us']
GENERAL_ROOT = ['com', 'org', 'net', 'edu', 'gov', 'info', 'int', 'tv']
PROTOCAL = ['http', 'https', 'ftp', 'ldap']
prepare_pattern()[source]
run()[source]

run URL clean process.

urlclean_line(line, lineno)[source]
corpustools.clean.url.run(clean_config, corpustools_config, step)[source]

entry function.

corpustools.clean.url.validate(step)[source]

zstring Module

ZString Sequence Clean Module

Support conversion for Adobe’s ZString escape sequence.

corpustools.clean.zstring.run(clean_config, corpustool_config, step)[source]

entry function.

corpustools.clean.zstring.validate(step)[source]
corpustools.clean.zstring.zstring_unescape(line, zdict)[source]

unescape the zstring name form and number form of escape sequence.