3. TMX2Text Converter

3.1. Overview

Before feeding the file into moses system for training, we should convert them into plain text first. TMX2Text Converter(tmx2txt.py) is designed for converting TMX file(s) into plain text files in UTF-8 encoding.

Command line syntax:

Usage: tmx2txt.py [options] file|directory source_lang target_lang

Options:
  --version             show program's version number and exit
  -h, --help            show this help message and exit
  -o DIR, --output-dir=DIR
                        output directory
  -l FILE, --log=FILE   log file
  -D, --Debug           logging debug message

3.2. Module API Documentation

3.2.1. tmx2txt Module

3.2.2. format.tmxparser Module

TMX Parser Module

class corpustools.format.tmxparser.TMXParser[source]

TMXParser read TMX file and extract the specified languages sentence align.

This tmx parser use xml.parsers.expat as xml parser engine.

parse_file(filename, source_lang, target_lang)[source]

Expat parser callback function.

start_element_handler(name, attributes)[source]

Expat parser callback function.

end_element_handler(name)[source]

Expat parser callback function.

char_data_handler(data)[source]

Expat parser callback function.