https://raw.githubusercontent.com/LanguageMachines/ucto/master/logo.svg  
  Title
    Ucto Tokeniser Engine  
  Description
     The Ucto tokenisation engine is a language-independent engine that, given an external configuration file with tokenisation rules for a specifc language ,yields a tokenizer for that language that tokenizes text files: it separates words from punctuation, and splits sentences. This is one of the first tasks for almost any Natural Language Processing application. Ucto offers several other basic preprocessing steps such as changing case that you can all use to make your text suited for further processing such as indexing, part-of-speech tagging, or machine translation. The tokeniser engine is language independent. By supplying  language-specific tokenisation rules in an external configuration file a tokeniser can be created for a specific language. Ucto comes with tokenization rules for English, Dutch, French, Italian, and Swedish; it is easily extensible to other languages. It recognizes dates, times, units, currencies, abbreviations. It recognizes paired quote spans, sentences, and paragraphs. It produces UTF8 encoding and NFC output normalization, optionally accepts other encodings as input.  Optional conversion to all lowercase or uppercase. Ucto supports FoLiA XML.
		     
  Project
    CLARIN-NL  
  
    CLARIAH-CORE  
  CLARIN National Project
CLARIN centre
    none yet  
  Research domain
Linguistic Subject
Tool task
Country
    Netherlands  
  Tool Type
Research Phase
Tool status
Input format
Output format
Input Language
Version
    v0.13  
  Access Contact
Project Contact
Creator Contact
    
				  Ko van der Sloot
                    
  Documentation
Source code
Resource
CMDI File Link
License
    GNU GPL