veer66

#Buddhism #NLP #L10N #FreeSoftware

4 June 2017

Thai Word Break In Various Programming Languages

by

A benchmark of Thai word tokenizers written in various programming languages

Date: 2017/06/04

The origin post was at https://veer66.wordpress.com/2017/01/19/benchmark-thai-word-tokenizers/ posted on 2017/01/19.

I wonder about speed of programs written in different languages. For example, I wonder whether one written in Kotlin and ran on JVM is slower than one written in Go. Although there are several existing benchmarks, this is one may be still important at least for me, because Thai word tokenizer is my real task.

So @iporsut and me wrote some programs in different programming languages and tried to optimize them.

I conducted the experiment on my laptop computer, which has Intel® Core™ i3-4030U CPU @ 1.90GHz × 4, on a 20MB Thai text corpus.

Lang 1 (seconds) 2 (seconds) 3 (seconds) AVG (seconds)
Rust 3.366 3.247 3.241 3.284
Go 5.415 5.405 5.416 5.412
Crystal 5.637 5.679 5.649 5.655
Kotlin+Clojure 6.547 6.743 6.628 6.639
Julia 38.316 38.112 38.237 38.221
JavaScript 49.349 49.084 49.901 49.445
Python 50.624 50.803 50.869 50.765
Clojure+Kotlin 63.502 67.561 67.303 66.122

Additional setup

Env Source
Rust Nightly 2017-01-08 source
Go 1.7.4 source
Crystal 0.20.5 source
Kotlin 1.0.6 + Clojure 1.8.0 + OpenJDK 1.8 source, source
Julia 0.5.0 source
Node.js v6.5.0 source
Python 3.5.2 source
Clojure 1.8.0 + Kotlin 1.0.6 + OpenJDK 1.8 source

Future work

@iporsut has already written multicore versions, so maybe next month I will conduct another experiment.

tags: