corpustools Package¶
clean_corpus
Module¶
Corpus Clean Tool
Clean a bitext file according to clean steps. The corpus file should be like ‘path/filename.en-zhcn.bitext’. The config file of clean steps is a json style file. A working directory as well as output directory can be specified in command line, otherwise all intermediate result files will be put in same folder as bitext file.
Users can implement their own cleanup modules with python language, and put modules into folder “corpustools.clean”. Most of cleanup steps can be implemented as regular expression clean, some of them can be implemented as predicate clean. Sometimes, we need to run tokenization and lowercasing in cleanup steps. These steps are implemented by calling external tools.
- Current support external tools:
- Tokenizer : Stanford Chinese Segmentor (Chinese)
- Tokenizer : Chasen (Japanese)
- Tokenizer : Moses tokenizer (multilingual European languages)
- Caser : Moses case tool
Command line Syntax:
Usage: clean-corpus.py [options] corpus_file clean_steps
Options:
--version show program's version number and exit
-h, --help show this help message and exit
-c FILE, --config=FILE
specified corpus tools config
-w DIR, --working-dir=DIR
working directory
-o DIR, --output-dir=DIR
output directory
Args:
corpus_file: The path to corpus file.
clean_steps: Configuration file of clean steps.
-
corpustools.clean_corpus.
argv2conf
(argv)[source]¶ Parse command line arguments, and construct the corpus clean and external corpus tools configuration.
For external tools configuration, read the system-wide and user default configuration file first. If user give a configuration file of external tool in command line, it will be loaded also. A configuration file for cleanup steps must be provided as second argument in command line.
Parameters: argv – command line arguments. Returns: Exit program if arguments wrong, or failed to construct configuration files. Else return a tuple, corpus tools configuration and cleanup configuration, (corpustools_config, corpusclean_config).
-
corpustools.clean_corpus.
clean_corpus
(corpustools_config, clean_config)[source]¶ Clean the bitext file.
Copy the corpus file into working directory, run the user-specified clean steps, keep the result for every steps, finally put the clean corpus file into output directory.
-
corpustools.clean_corpus.
predicate_clean
(clean_config, step, predicate)[source]¶ Clean the corpus in a way called ‘predicate clean’.
Predicate clean can be invoked for those clean rules which only accept or drop the TUs from corpus according result returned by a predicate function (a function return True or False). Drop the align if predicate is True.