6. ChangeLog

6.1. SMT Corpus Tools 1.0

Corpus Clean Tool 1.0

  • Add the log for predicate clean and regex clean.

6.2. SMT Corpus Tools 0.9

Corpus Clean Tool 0.9

  • Support predicate clean
  • Support regular expression clean
  • Implement lowercase in python which is faster than perl script in moses
  • Implement three built-in predicate clean: lenght diff, length limit, sentence ratio.
  • Internal language code converter
  • Support moses tokenizer, chasen and Stanford Word Segmenter.
  • Check whether number of lines of corpus is identical after tokenization.

TMX2Text Converter 1.0

  • Parsing TMX file and extract the specified languages sentence align.
  • Parsing all .tmx files in a directory.
  • Logging the results when parsing tmx files in a directory.
  • Check the number of lines in generated corpus.