2. Moses corpus clean tool

2.1. Overview

A common task for SMT is cleaning up the corpus files. Clean up the long sentences and replace illegal characters before feed the corpus files for training or creating language model in Moses system. Furthermore, we are cleaning/replacing the strings through regular expression according to some rules to improve the trained translation model.

Please refer clean_corpus Module module for command line syntax and Clean Config for writing your clean steps.

Most of clean steps can be implemented as Regular Expression Clean, while others can be implemented as Predicate Clean which drop the corpus align if predicate is failed. The predicate clean way simplified the code writing for this kind of clean.

Moses corpus clean tool is designed to be very extensible by external clean modules. User can Write own clean module to implement the clean step.

Some external corpus tools is needed, e.g. kinds of tokenizer or segmenter for different languages. These external corpus tools should installed separately, and configured in corpus tools config.

2.2. Clean Config

A configuration file describe the user-defined clean steps in json format. As in json format, we can edit this configuration file easily, add/remove steps or modify attributes for one step even in a simple text editor.

You can find other attriubtes in the instance of CleanConfig to represent other facters in a cleaning process, e.g. files, directories, languages etc. Please refer config.clean_config Module module.

Reference:
A sample configuration of clean steps.

2.3. Clean Steps

2.3.1. Regular Expression Clean

Most of cleanup are deleting strings or replacing strings in align sentences. User can specify regular expression in configuration. A typical regex clean step should include description, action, pattern at least. The value of action can be delete_line, delete or replace. If action is replace, repl is needed. pattern is a regular expression, but in json format every backslash should be escaped, e.g. write the regex \d as \\d in json cofiguration. The only character ‘’ need to be escaped.

The additional option can be specified:

  • apply_to indicate which sentence should be cleaned, default Both.
  • unicode indicate regular expression is unicode awareness. default true.
  • case_sensitive indicate whether or not search is case sensitive, default false(insensitive).
{
  "description": "integer",
  "action": "replace",
  "pattern" : "\\d+",
  "repl" : "num",
  "apply_to": "source",
  "unicode": true,
  "case_sensitive": true
}

2.3.2. Predicate Clean

Now we have the following predicate clean modules built in. In predicate clean module, the function predicate must be implemented. Please refer predicate clean for signatures of these functions.

  • clean align beyond the length limit.
  • clean wrong sentence ratio align
  • clean length diff align

2.3.3. Write own clean module

You can write own clean module to extend the corpus clean tool to support new clean rules. The new clean module should be put into the sub-package corpustools.clean, and have an entry function run(clean, tools, step). You can get whole clean information from clean configuration, and get external corpus tools from tools configuration. The parameter step indicate the configuration for current step. Please refer the source code for example.

For predicate clean, corpus corpus tool had implemented the common code, so you only need to provide a function to return True or False according to current sentence align. Also please refer the built-in modules for example.

2.4. Module API Documentation

2.4.1. clean_corpus Module

Corpus Clean Tool

Clean a bitext file according to clean steps. The corpus file should be like ‘path/filename.en-zhcn.bitext’. The config file of clean steps is a json style file. A working directory as well as output directory can be specified in command line, otherwise all intermediate result files will be put in same folder as bitext file.

Users can implement their own cleanup modules with python language, and put modules into folder “corpustools.clean”. Most of cleanup steps can be implemented as regular expression clean, some of them can be implemented as predicate clean. Sometimes, we need to run tokenization and lowercasing in cleanup steps. These steps are implemented by calling external tools.

Current support external tools:
  • Tokenizer : Stanford Chinese Segmentor (Chinese)
  • Tokenizer : Chasen (Japanese)
  • Tokenizer : Moses tokenizer (multilingual European languages)
  • Caser : Moses case tool

Command line Syntax:

Usage: clean-corpus.py [options] corpus_file clean_steps

Options:
  --version             show program's version number and exit
  -h, --help            show this help message and exit
  -c FILE, --config=FILE
                        specified corpus tools config
  -w DIR, --working-dir=DIR
                        working directory
  -o DIR, --output-dir=DIR
                        output directory

Args:
    corpus_file:    The path to corpus file.
    clean_steps:    Configuration file of clean steps.
corpustools.clean_corpus.main(argv)[source]

entry function.

corpustools.clean_corpus.argv2conf(argv)[source]

Parse command line arguments, and construct the corpus clean and external corpus tools configuration.

For external tools configuration, read the system-wide and user default configuration file first. If user give a configuration file of external tool in command line, it will be loaded also. A configuration file for cleanup steps must be provided as second argument in command line.

Parameters:argv – command line arguments.
Returns:Exit program if arguments wrong, or failed to construct configuration files. Else return a tuple, corpus tools configuration and cleanup configuration, (corpustools_config, corpusclean_config).
corpustools.clean_corpus.clean_corpus(corpustools_config, clean_config)[source]

Clean the bitext file.

Copy the corpus file into working directory, run the user-specified clean steps, keep the result for every steps, finally put the clean corpus file into output directory.

corpustools.clean_corpus.predicate_clean(clean_config, step, predicate)[source]

Clean the corpus in a way called ‘predicate clean’.

Predicate clean can be invoked for those clean rules which only accept or drop the TUs from corpus according result returned by a predicate function (a function return True or False). Drop the align if predicate is True.

2.4.2. config.clean_config Module

2.4.3. clean.regex Module

Regular expression clean module.

corpustools.clean.regex.run(clean_config, corpustools_config, step)[source]

entry function.

class corpustools.clean.regex.RegexClean(clean, step)[source]

Class RegexClean run regular expression clean on source and target corpus.

run()[source]

clean the corpus.

compile_relist()[source]

Compile the regular expressions to re objects before using them to improve performance. The compiled pattern is assigned back to clean step to replace the string form of pattern.

relist_clean(line)[source]

Clean the line with a list of re steps.

re_clean(sentence)[source]

Clean the sentence with clean step, return cleaned corpus sentence.

Parameters:sentence – unicode string, corpus sentence.

Example of clean step.

{
  "description": "delete cdata",
  "action": "replace",
  "pattern" : "CDATA",
  "repl" : "",
  "apply_to": "source",
  "unicode": true,
  "case_sensitive": true,
  "log": "detail"
}
re_del(sentence, pattern)[source]

Return empty string if pattern matched.

Parameters:
  • sentence – unicode string, corpus sentence.
  • pattern – re object.
re_repl(sentence, pattern, repl)[source]

Return substituted sentence.

Parameters:
  • sentence – unicode string, corpus sentences.
  • pattern – re object.
  • repl – unicode string.

2.4.4. predicate clean

corpustools.clean.length_diff.predicate(source, target, constraint)[source]

Return True if the distance between source and target is beyond the limit.

corpustools.clean.length_limit.predicate(source, target, constraint)[source]

Return True if the length of source and/or target is beyond the limit.

The length limit for GIZA++ in moses is 100 tokens.

corpustools.clean.sentence_ratio.predicate(source, target, constraint)[source]

Return True if the sentences ratio is beyond the threshold.

sentences ratio = source length / target length or target length / source length

The threshold of ratio in moses system is 9.

2.4.5. clean.tokenize Module

Tokenize module in corpus clean tools

corpustools.clean.tokenize.tokenize(clean, tools, step, lang)[source]

Tokenize the corpus files in corpus clean working directroy.

Actually, this function works as router to dispatch the request to tokenizers in token subpackage. The modules in token subpackage are adapters to external tokenizer tools.

Parameters:
  • clean – corpus clean configuration.
  • tools – external tools configuration.
  • step – clean step.
  • lang – specify the language of which corpus be tokenize.
corpustools.clean.tokenize.run(clean, tools, step)[source]

Clean module interface function, run tokenization for corpus files.

2.4.6. clean.lowercase Module

Lowercase module for corpus clean tool

corpustools.clean.lowercase.run(clean, tools, step)[source]

Clean module interface function, lowercase corpus files.

corpustools.clean.lowercase.lowercase_corpus(clean, lang, ext)[source]

Lowercase corpus files, dispatch the lowercase request to lowercase module in corpustools.case.