corpustools Package

clean_corpus Module

Corpus Clean Tool

Clean a bitext file according to clean steps. The corpus file should be like ‘path/filename.en-zhcn.bitext’. The config file of clean steps is a json style file. A working directory as well as output directory can be specified in command line, otherwise all intermediate result files will be put in same folder as bitext file.

Users can implement their own cleanup modules with python language, and put modules into folder “corpustools.clean”. Most of cleanup steps can be implemented as regular expression clean, some of them can be implemented as predicate clean. Sometimes, we need to run tokenization and lowercasing in cleanup steps. These steps are implemented by calling external tools.

Current support external tools:
  • Tokenizer : Stanford Chinese Segmentor (Chinese)
  • Tokenizer : Chasen (Japanese)
  • Tokenizer : Moses tokenizer (multilingual European languages)
  • Caser : Moses case tool

Command line Syntax:

Usage: clean-corpus.py [options] corpus_file clean_steps

Options:
  --version             show program's version number and exit
  -h, --help            show this help message and exit
  -c FILE, --config=FILE
                        specified corpus tools config
  -w DIR, --working-dir=DIR
                        working directory
  -o DIR, --output-dir=DIR
                        output directory

Args:
    corpus_file:    The path to corpus file.
    clean_steps:    Configuration file of clean steps.
corpustools.clean_corpus.argv2conf(argv)[source]

Parse command line arguments, and construct the corpus clean and external corpus tools configuration.

For external tools configuration, read the system-wide and user default configuration file first. If user give a configuration file of external tool in command line, it will be loaded also. A configuration file for cleanup steps must be provided as second argument in command line.

Parameters:argv – command line arguments.
Returns:Exit program if arguments wrong, or failed to construct configuration files. Else return a tuple, corpus tools configuration and cleanup configuration, (corpustools_config, corpusclean_config).
corpustools.clean_corpus.clean_corpus(corpustools_config, clean_config)[source]

Clean the bitext file.

Copy the corpus file into working directory, run the user-specified clean steps, keep the result for every steps, finally put the clean corpus file into output directory.

corpustools.clean_corpus.main(argv)[source]

entry function.

corpustools.clean_corpus.predicate_clean(clean_config, step, predicate)[source]

Clean the corpus in a way called ‘predicate clean’.

Predicate clean can be invoked for those clean rules which only accept or drop the TUs from corpus according result returned by a predicate function (a function return True or False). Drop the align if predicate is True.

lines Module

merge_corpus Module

tmx2txt Module