token Package

chasen Module

Tokenizer Module for Japanese Segmenter Chasen

corpustools.token.chasen.tokenize(infile, outfile, lang, tools, step)[source]

Call chasen (Japanese Segmenter) for Japanese text.

Parameters:
  • infile – input filename.
  • outfile – output filename.
  • lang – language of corpus.
  • tools – external tools configuration.

moses Module

Tokenizer Module for Moses built-in tokenizer

corpustools.token.moses.tokenize(infile, outfile, lang, tools, step)[source]

Call moses built-in tokenizer for corpus.

Moses built-in tokenizer support European languages.

Parameters:
  • infile – input filename.
  • outfile – output filename.
  • lang – language of corpus.
  • tools – external tools configuration.

stanford_segmenter Module

Tokenizer Module for Stanford Segmenter

corpustools.token.stanford_segmenter.tokenize(infile, outfile, lang, tools, step)[source]

Call Stanford Segmenter for Chinese text.

Parameters:
  • infile – input filename.
  • outfile – output filename.
  • lang – corpus language.
  • tools – external tools configuration.
  • step – tokenizer configuration in step.