Shopping Cart


No products in the cart.



[Datasets] Japanese-English Bilingual Corpus

This is a list of language resources that can be used to build a machine translation system for Japanese. I will continue to collect information.

* JParaCrawl 
The largest publicly available English-Japanese parallel corpus created by crawling the web and automatically aligning parallel sentences by NTT.
License: Research purpose only (Need to contact for commercial use)

* Japanese-English Subtitle Corpus
JESC is the product of a collaboration between Stanford University, Google Brain, and Rakuten Institute of Technology. It was created by crawling the internet for movie and tv subtitles and aligining their captions. It is one of the largest freely available EN-JA corpus, and covers the poorly represented domain of colloquial language.
License: Creative Commons (CC) license.

*Asian Scientific Paper Excerpt Corpus
It consists of a Japanese-English paper abstract corpus of 3M parallel sentences (ASPEC-JE) and a Japanese-Chinese paper excerpt corpus of 680K parallel sentences (ASPEC-JC). This corpus is one of the achievements of the Japanese-Chinese machine translation project which was run in Japan from 2006 to 2010 by the Japan Science and Technology Agency (JST) and the National Institute of Information and Communications Technology (NICT).
License: Research purpose only (Commercial use is prohibited)

* Japanese-English Bilingual Corpus of Laws and Regulations
The crawling data from Japanese Law Translation (
License: Free

Japanese-English Bilingual Corpus | Kaggle
The Japanese-English Bilingual Corpus of Wikipedia’s Kyoto Articles” aims mainly at supporting research and development relevant to high-performance multilingual machine translation, information extraction, and other language processing technologies.
License:  Creative Commons Attribution-Share-Alike License 3.0.