Python Environment
This project requires Python 3 SciPy environment installed.
You must have Keras (2.0 or higher) installed with either the TensorFlow or Theano backend.
The tutorial also assumes you have NumPy and Matplotlib installed.
German to English Translation Dataset
We will use a dataset of German to English terms used as the basis for flashcards for language learning.
The dataset is available from the ManyThings.org website, with examples drawn from the Tatoeba Project. The dataset is comprised of German phrases and their English counterparts and is intended to be used with the Anki flashcard software.
The page provides a list of many language pairs, and I encourage you to explore other languages:
The dataset we will use in this tutorial is available for download here:
Download the dataset to your current working directory and decompress.
Preparing the Text Data
The next step is to prepare the text data ready for modeling.
Take a look at the raw data and note what you see that we might need to handle in a data cleaning operation.
For example, here are some observations I note from reviewing the raw data:
- There is punctuation.
- The text contains uppercase and lowercase.
- There are special characters in the German.
- There are duplicate phrases in English with different translations in German.
- The file is ordered by sentence length with very long sentences toward the end of the file.
Did you note anything else that could be important?
Let me know in the comments below.
Let me know in the comments below.
A good text cleaning procedure may handle some or all of these observations.
Data preparation is divided into two subsections:
- Clean Text
- Split Text
Thanks, for the first part.