2. Moses corpus clean tool¶
2.1. Overview¶
A common task for SMT is cleaning up the corpus files. Clean up the long sentences and replace illegal characters before feed the corpus files for training or creating language model in Moses system. Furthermore, we are cleaning/replacing the strings through regular expression according to some rules to improve the trained translation model.
Please refer clean_corpus Module module for command line syntax and Clean Config for writing your clean steps.
Most of clean steps can be implemented as Regular Expression Clean, while others can be implemented as Predicate Clean which drop the corpus align if predicate is failed. The predicate clean way simplified the code writing for this kind of clean.
Moses corpus clean tool is designed to be very extensible by external clean modules. User can Write own clean module to implement the clean step.
Some external corpus tools is needed, e.g. kinds of tokenizer or segmenter for different languages. These external corpus tools should installed separately, and configured in corpus tools config.
2.2. Clean Config¶
A configuration file describe the user-defined clean steps in json format. As in json format, we can edit this configuration file easily, add/remove steps or modify attributes for one step even in a simple text editor.
You can find other attriubtes in the instance of CleanConfig to represent other facters in a cleaning process, e.g. files, directories, languages etc. Please refer config.clean_config Module module.
- Reference:
- A sample configuration of clean steps.
2.3. Clean Steps¶
2.3.1. Regular Expression Clean¶
Most of cleanup are deleting strings or replacing strings in align sentences. User can specify regular
expression in configuration. A typical regex clean step should include description
, action
, pattern
at least. The value of action can be delete_line
, delete
or replace
. If action is replace,
repl
is needed. pattern is a regular expression, but in json format every backslash should be escaped,
e.g. write the regex \d
as \\d
in json cofiguration. The only character ‘’ need to be escaped.
The additional option can be specified:
apply_to
indicate which sentence should be cleaned, default Both.unicode
indicate regular expression is unicode awareness. default true.case_sensitive
indicate whether or not search is case sensitive, default false(insensitive).
{
"description": "integer",
"action": "replace",
"pattern" : "\\d+",
"repl" : "num",
"apply_to": "source",
"unicode": true,
"case_sensitive": true
}
2.3.2. Predicate Clean¶
Now we have the following predicate clean modules built in. In predicate clean module,
the function predicate
must be implemented. Please refer predicate clean for signatures
of these functions.
- clean align beyond the length limit.
- clean wrong sentence ratio align
- clean length diff align
2.3.3. Write own clean module¶
You can write own clean module to extend the corpus clean tool to support new clean rules. The new clean module
should be put into the sub-package corpustools.clean
, and have an entry function run(clean, tools, step).
You can get whole clean information from clean configuration, and get external corpus tools from tools configuration.
The parameter step indicate the configuration for current step. Please refer the source code for example.
For predicate clean, corpus corpus tool had implemented the common code, so you only need to provide a function to return True or False according to current sentence align. Also please refer the built-in modules for example.
2.4. Module API Documentation¶
2.4.1. clean_corpus
Module¶
Corpus Clean Tool
Clean a bitext file according to clean steps. The corpus file should be like ‘path/filename.en-zhcn.bitext’. The config file of clean steps is a json style file. A working directory as well as output directory can be specified in command line, otherwise all intermediate result files will be put in same folder as bitext file.
Users can implement their own cleanup modules with python language, and put modules into folder “corpustools.clean”. Most of cleanup steps can be implemented as regular expression clean, some of them can be implemented as predicate clean. Sometimes, we need to run tokenization and lowercasing in cleanup steps. These steps are implemented by calling external tools.
- Current support external tools:
- Tokenizer : Stanford Chinese Segmentor (Chinese)
- Tokenizer : Chasen (Japanese)
- Tokenizer : Moses tokenizer (multilingual European languages)
- Caser : Moses case tool
Command line Syntax:
Usage: clean-corpus.py [options] corpus_file clean_steps
Options:
--version show program's version number and exit
-h, --help show this help message and exit
-c FILE, --config=FILE
specified corpus tools config
-w DIR, --working-dir=DIR
working directory
-o DIR, --output-dir=DIR
output directory
Args:
corpus_file: The path to corpus file.
clean_steps: Configuration file of clean steps.
-
corpustools.clean_corpus.
argv2conf
(argv)[source]¶ Parse command line arguments, and construct the corpus clean and external corpus tools configuration.
For external tools configuration, read the system-wide and user default configuration file first. If user give a configuration file of external tool in command line, it will be loaded also. A configuration file for cleanup steps must be provided as second argument in command line.
Parameters: argv – command line arguments. Returns: Exit program if arguments wrong, or failed to construct configuration files. Else return a tuple, corpus tools configuration and cleanup configuration, (corpustools_config, corpusclean_config).
-
corpustools.clean_corpus.
clean_corpus
(corpustools_config, clean_config)[source]¶ Clean the bitext file.
Copy the corpus file into working directory, run the user-specified clean steps, keep the result for every steps, finally put the clean corpus file into output directory.
-
corpustools.clean_corpus.
predicate_clean
(clean_config, step, predicate)[source]¶ Clean the corpus in a way called ‘predicate clean’.
Predicate clean can be invoked for those clean rules which only accept or drop the TUs from corpus according result returned by a predicate function (a function return True or False). Drop the align if predicate is True.
2.4.2. config.clean_config
Module¶
2.4.3. clean.regex
Module¶
Regular expression clean module.
-
class
corpustools.clean.regex.
RegexClean
(clean, step)[source]¶ Class RegexClean run regular expression clean on source and target corpus.
-
compile_relist
()[source]¶ Compile the regular expressions to re objects before using them to improve performance. The compiled pattern is assigned back to clean step to replace the string form of pattern.
-
re_clean
(sentence)[source]¶ Clean the sentence with clean step, return cleaned corpus sentence.
Parameters: sentence – unicode string, corpus sentence. Example of clean step.
{ "description": "delete cdata", "action": "replace", "pattern" : "CDATA", "repl" : "", "apply_to": "source", "unicode": true, "case_sensitive": true, "log": "detail" }
-
2.4.4. predicate clean¶
-
corpustools.clean.length_diff.
predicate
(source, target, constraint)[source]¶ Return True if the distance between source and target is beyond the limit.
2.4.5. clean.tokenize
Module¶
Tokenize module in corpus clean tools
-
corpustools.clean.tokenize.
tokenize
(clean, tools, step, lang)[source]¶ Tokenize the corpus files in corpus clean working directroy.
Actually, this function works as router to dispatch the request to tokenizers in token subpackage. The modules in token subpackage are adapters to external tokenizer tools.
Parameters: - clean – corpus clean configuration.
- tools – external tools configuration.
- step – clean step.
- lang – specify the language of which corpus be tokenize.