clean Package¶

`duplicate` Module¶

`html` Module¶

HTML Clean Module

Unescape the HTML entity (name or codepoint form) to unicode char, remove html tags.

corpustools.clean.html.clean_html(parser, line)[source]¶: Unescape xml escape sequences, html entities and clean html tags.

corpustools.clean.html.clean_htmltag(line)[source]¶: clean html tags.

corpustools.clean.html.run(clean_config, corpustools_config, step)[source]¶: entry function.

corpustools.clean.html.validate(step)[source]¶

`identical` Module¶

`length_diff` Module¶

Predicate Module: Length Distance

corpustools.clean.length_diff.predicate(source, target, constraint)[source]¶: Return True if the distance between source and target is beyond the limit.

corpustools.clean.length_diff.validate(step)[source]¶

`length_limit` Module¶

Predicate Module: Length Limit

corpustools.clean.length_limit.predicate(source, target, constraint)[source]¶

Return True if the length of source and/or target is beyond the limit.

The length limit for GIZA++ in moses is 100 tokens.

corpustools.clean.length_limit.validate(step)[source]¶

`lowercase` Module¶

Lowercase module for corpus clean tool

corpustools.clean.lowercase.lowercase_corpus(clean, lang, ext)[source]¶: Lowercase corpus files, dispatch the lowercase request to lowercase module in corpustools.case.

corpustools.clean.lowercase.run(clean, tools, step)[source]¶: Clean module interface function, lowercase corpus files.

`newline` Module¶

`regex` Module¶

Regular expression clean module.

class corpustools.clean.regex.RegexClean(clean, step)[source]¶

Bases: object

Class RegexClean run regular expression clean on source and target corpus.

compile_relist()[source]¶: Compile the regular expressions to re objects before using them to improve performance. The compiled pattern is assigned back to clean step to replace the string form of pattern.

re_clean(sentence)[source]¶

Clean the sentence with clean step, return cleaned corpus sentence.

Parameters:	sentence – unicode string, corpus sentence.

Example of clean step.

{
  "description": "delete cdata",
  "action": "replace",
  "pattern" : "CDATA",
  "repl" : "",
  "apply_to": "source",
  "unicode": true,
  "case_sensitive": true,
  "log": "detail"
}

re_del(sentence, pattern)[source]¶

Return empty string if pattern matched.

Parameters:	sentence – unicode string, corpus sentence. pattern – re object.

re_repl(sentence, pattern, repl)[source]¶

Return substituted sentence.

Parameters:	sentence – unicode string, corpus sentences. pattern – re object. repl – unicode string.

relist_clean(line)[source]¶: Clean the line with a list of re steps.

run()[source]¶: clean the corpus.

corpustools.clean.regex.run(clean_config, corpustools_config, step)[source]¶: entry function.

corpustools.clean.regex.validate(step)[source]¶

`sentence_ratio` Module¶

Predicate Module: Sentences Ratio

corpustools.clean.sentence_ratio.predicate(source, target, constraint)[source]¶

Return True if the sentences ratio is beyond the threshold.

sentences ratio = source length / target length or target length / source length

The threshold of ratio in moses system is 9.

corpustools.clean.sentence_ratio.validate(step)[source]¶

`similar` Module¶

`tokenize` Module¶

Tokenize module in corpus clean tools

corpustools.clean.tokenize.run(clean, tools, step)[source]¶: Clean module interface function, run tokenization for corpus files.

corpustools.clean.tokenize.tokenize(clean, tools, step, lang)[source]¶

Tokenize the corpus files in corpus clean working directroy.

Actually, this function works as router to dispatch the request to tokenizers in token subpackage. The modules in token subpackage are adapters to external tokenizer tools.

Parameters:	clean – corpus clean configuration. tools – external tools configuration. step – clean step. lang – specify the language of which corpus be tokenize.

`url` Module¶

URL Clean Module

Clean the URL-like text as I can.

class corpustools.clean.url.URLClean(clean, step)[source]¶

Bases: object

Class of cleaning the url-like text from corpus.

COUNTRY_ROOT = ['uk', 'eu', 'au', 'br', 'ca', 'cn', 'ch', 'cz', 'de', 'dk', 'es', 'fi', 'fr', 'gr', 'hk', 'hu', 'ie', 'il', 'in', 'it', 'jp', 'kr', 'nl', 'no', 'pl', 'pt', 'ro', 'ru', 'se', 'sg', 'tr', 'tw', 'ua', 'us']¶

GENERAL_ROOT = ['com', 'org', 'net', 'edu', 'gov', 'info', 'int', 'tv']¶

PROTOCAL = ['http', 'https', 'ftp', 'ldap']¶

prepare_pattern()[source]¶

run()[source]¶: run URL clean process.

urlclean_line(line, lineno)[source]¶

corpustools.clean.url.run(clean_config, corpustools_config, step)[source]¶: entry function.

corpustools.clean.url.validate(step)[source]¶

`zstring` Module¶

ZString Sequence Clean Module

Support conversion for Adobe’s ZString escape sequence.

corpustools.clean.zstring.run(clean_config, corpustool_config, step)[source]¶: entry function.

corpustools.clean.zstring.validate(step)[source]¶

corpustools.clean.zstring.zstring_unescape(line, zdict)[source]¶: unescape the zstring name form and number form of escape sequence.

clean Package¶

duplicate Module¶

html Module¶

identical Module¶

length_diff Module¶

length_limit Module¶

lowercase Module¶

newline Module¶

regex Module¶

sentence_ratio Module¶

similar Module¶

tokenize Module¶

url Module¶

zstring Module¶