4. External corpus tools

4.1. Corpus tools config

Corpus tools configuration is a ini style file in which we can write the options for tools. The most important option is the path of tool executable program. Other programs, e.g. corpus clean tool, can get the information of these external tools, then call them in apropriate way.

For accessing the value, class CorpusToolsConfig support a more intuitive way, i.e. use section.option as key, e.g. tools["moses.scripts_path"].

You can find a sample of external tools configuration file corpustools.conf from repository in which the path of essential tools (moses scripts and two tokenizer) is configured. It’s external tool’s responsibility to write the correct info into this configuration file.

4.2. External Tools

4.2.1. Tokenizer

For tokenization, we call some external tokenizer for specified language(s):

  • The tokenizer Perl script in moses
  • Stanford Word Segmenter
  • Chasen Japanese Segmenter

You can download the latest release from official website of Stanford Word Segmenter. Put them somewhere, and configure the path in corpus tools configuration. Currently corpus tools will call the script provided by Stanford word segmenter directly.

I re-package the Chasen to simplify the compilation and support corpus files with UTF-8 encoding directly. Please refer another sub-packages chasen-moses in project Moses Suite for detail. As this package need to be built into binary executable, you can follow the instruction in sub-package chasen-moses to build and install it. Or following the instruction to get a pre-compiled binary. After installing, don’t forget to configure it in tools configuration.

4.3. Module API Documentation

4.3.1. config.corpustools_config Module

4.3.2. token.moses Module

Tokenizer Module for Moses built-in tokenizer

corpustools.token.moses.tokenize(infile, outfile, lang, tools, step)[source]

Call moses built-in tokenizer for corpus.

Moses built-in tokenizer support European languages.

Parameters:
  • infile – input filename.
  • outfile – output filename.
  • lang – language of corpus.
  • tools – external tools configuration.

4.3.3. token.stanford_segmenter Module

Tokenizer Module for Stanford Segmenter

corpustools.token.stanford_segmenter.tokenize(infile, outfile, lang, tools, step)[source]

Call Stanford Segmenter for Chinese text.

Parameters:
  • infile – input filename.
  • outfile – output filename.
  • lang – corpus language.
  • tools – external tools configuration.
  • step – tokenizer configuration in step.

4.3.4. token.chasen Module

Tokenizer Module for Japanese Segmenter Chasen

corpustools.token.chasen.tokenize(infile, outfile, lang, tools, step)[source]

Call chasen (Japanese Segmenter) for Japanese text.

Parameters:
  • infile – input filename.
  • outfile – output filename.
  • lang – language of corpus.
  • tools – external tools configuration.