<style>/*
<link href='//cdnjs.cloudflare.com/ajax/libs/highlight.js/9.15.6/styles/atom-one-dark-reasonable.min.css' rel='stylesheet'/>
<script src='//cdnjs.cloudflare.com/ajax/libs/highlight.js/9.15.6/highlight.min.js'></script>
<script src='//cdnjs.cloudflare.com/ajax/libs/highlight.js/9.15.6/languages/r.min.js'></script>
<script>hljs.initHighlightingOnLoad();</script>
<link href='https://www.blogger.com/dyn-css/authorization.css?targetBlogID=1030619672955979982&zx=0d9d1de8-2d32-4925-b448-1c80d72c05cc' media='none' onload='if(media!='all')media='all'' rel='stylesheet'/><noscript><link href='https://www.blogger.com/dyn-css/authorization.css?targetBlogID=1030619672955979982&zx=0d9d1de8-2d32-4925-b448-1c80d72c05cc' rel='stylesheet'/></noscript>
<meta name='google-adsense-platform-account' content='ca-host-pub-1556223355139109'/>
<meta name='google-adsense-platform-domain' content='blogspot.com'/>

<!-- data-ad-client=ca-pub-6226434833206940 -->

</head><body>*/</style>

Day 7 - Developing a Neural Machine Translation System from Scratch Part 1

Avnish Yadav

2 Sept, 2018

Hello guys,

This is day 7 of my #100DayOfMLCode challenge. Today, I am planning to make a Neural Machine Translation System from German to English and English to German.

Step involved in this project -

German to English Translation Dataset
Preparing the Text Data
Train Neural Translation Model
Evaluate Neural Translation Model

Python Environment

This project requires Python 3 SciPy environment installed.

You must have Keras (2.0 or higher) installed with either the TensorFlow or Theano backend.

The tutorial also assumes you have NumPy and Matplotlib installed.

German to English Translation Dataset

We will use a dataset of German to English terms used as the basis for flashcards for language learning.

The dataset is available from the ManyThings.org website, with examples drawn from the Tatoeba Project. The dataset is comprised of German phrases and their English counterparts and is intended to be used with the Anki flashcard software.

The page provides a list of many language pairs, and I encourage you to explore other languages:

Tab-delimited Bilingual Sentence Pairs

The dataset we will use in this tutorial is available for download here:

German – English due-eng.zip

Download the dataset to your current working directory and decompress.

Preparing the Text Data

The next step is to prepare the text data ready for modeling.

Take a look at the raw data and note what you see that we might need to handle in a data cleaning operation.

For example, here are some observations I note from reviewing the raw data:

There is punctuation.
The text contains uppercase and lowercase.
There are special characters in the German.
There are duplicate phrases in English with different translations in German.
The file is ordered by sentence length with very long sentences toward the end of the file.

Did you note anything else that could be important?
Let me know in the comments below.

A good text cleaning procedure may handle some or all of these observations.

Data preparation is divided into two subsections:

Clean Text
Split Text

Thanks, for the first part.

<style>/*
<script type="text/javascript" src="https://www.blogger.com/static/v1/widgets/3071540258-widgets.js"></script>
<script type='text/javascript'>
window['__wavt'] = 'AOuZoY5AbslE2z_-5wtUevGVSG6suYsL-Q:1752286240263';_WidgetManager._Init('//www.blogger.com/rearrange?blogID\x3d1030619672955979982','//www.avnishyadav.com/2018/09/day-7-developing-neural-machine.html','1030619672955979982');
_WidgetManager._SetDataContext([{'name': 'blog', 'data': {'blogId': '1030619672955979982', 'title': 'Avnish Yadav', 'url': 'https://www.avnishyadav.com/2018/09/day-7-developing-neural-machine.html', 'canonicalUrl': 'https://www.avnishyadav.com/2018/09/day-7-developing-neural-machine.html', 'homepageUrl': 'https://www.avnishyadav.com/', 'searchUrl': 'https://www.avnishyadav.com/search', 'canonicalHomepageUrl': 'https://www.avnishyadav.com/', 'blogspotFaviconUrl': 'https://www.avnishyadav.com/favicon.ico', 'bloggerUrl': 'https://www.blogger.com', 'hasCustomDomain': true, 'httpsEnabled': true, 'enabledCommentProfileImages': true, 'gPlusViewType': 'FILTERED_POSTMOD', 'adultContent': false, 'analyticsAccountNumber': '', 'encoding': 'UTF-8', 'locale': 'en-GB', 'localeUnderscoreDelimited': 'en_gb', 'languageDirection': 'ltr', 'isPrivate': false, 'isMobile': false, 'isMobileRequest': false, 'mobileClass': '', 'isPrivateBlog': false, 'isDynamicViewsAvailable': true, 'feedLinks': '\x3clink rel\x3d\x22alternate\x22 type\x3d\x22application/atom+xml\x22 title\x3d\x22Avnish Yadav - Atom\x22 href\x3d\x22https://www.avnishyadav.com/feeds/posts/default\x22 /\x3e\n\x3clink rel\x3d\x22alternate\x22 type\x3d\x22application/rss+xml\x22 title\x3d\x22Avnish Yadav - RSS\x22 href\x3d\x22https://www.avnishyadav.com/feeds/posts/default?alt\x3drss\x22 /\x3e\n\x3clink rel\x3d\x22service.post\x22 type\x3d\x22application/atom+xml\x22 title\x3d\x22Avnish Yadav - Atom\x22 href\x3d\x22https://www.blogger.com/feeds/1030619672955979982/posts/default\x22 /\x3e\n\n\x3clink rel\x3d\x22alternate\x22 type\x3d\x22application/atom+xml\x22 title\x3d\x22Avnish Yadav - Atom\x22 href\x3d\x22https://www.avnishyadav.com/feeds/1557072364076999890/comments/default\x22 /\x3e\n', 'meTag': '', 'adsenseClientId': 'ca-pub-6226434833206940', 'adsenseHostId': 'ca-host-pub-1556223355139109', 'adsenseHasAds': true, 'adsenseAutoAds': false, 'boqCommentIframeForm': true, 'loginRedirectParam': '', 'view': '', 'dynamicViewsCommentsSrc': '//www.blogblog.com/dynamicviews/4224c15c4e7c9321/js/comments.js', 'dynamicViewsScriptSrc': '//www.blogblog.com/dynamicviews/0a879ddeb6094a4d', 'plusOneApiSrc': 'https://apis.google.com/js/platform.js', 'disableGComments': true, 'interstitialAccepted': false, 'sharing': {'platforms': [{'name': 'Get link', 'key': 'link', 'shareMessage': 'Get link', 'target': ''}, {'name': 'Facebook', 'key': 'facebook', 'shareMessage': 'Share to Facebook', 'target': 'facebook'}, {'name': 'BlogThis!', 'key': 'blogThis', 'shareMessage': 'BlogThis!', 'target': 'blog'}, {'name': 'X', 'key': 'twitter', 'shareMessage': 'Share to X', 'target': 'twitter'}, {'name': 'Pinterest', 'key': 'pinterest', 'shareMessage': 'Share to Pinterest', 'target': 'pinterest'}, {'name': 'Email', 'key': 'email', 'shareMessage': 'Email', 'target': 'email'}], 'disableGooglePlus': true, 'googlePlusShareButtonWidth': 0, 'googlePlusBootstrap': '\x3cscript type\x3d\x22text/javascript\x22\x3ewindow.___gcfg \x3d {\x27lang\x27: \x27en_GB\x27};\x3c/script\x3e'}, 'hasCustomJumpLinkMessage': false, 'jumpLinkMessage': 'Read more', 'pageType': 'item', 'postId': '1557072364076999890', 'postImageThumbnailUrl': 'https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgLqn-uJrgN3kAv4eqHtTpJVxFZttvlJuLFHXrUqW1z_mcbU9GlbyNFHv2ujDLesH2AWbwkijtRPrjqfWuRuyO5LIyNOINM2J7XJ1j7G5e1mm54yiG_G-STT5LLCjAmEXbFpDU8dt3XR6t2/s72-c/Day-7.jpg', 'postImageUrl': 'https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgLqn-uJrgN3kAv4eqHtTpJVxFZttvlJuLFHXrUqW1z_mcbU9GlbyNFHv2ujDLesH2AWbwkijtRPrjqfWuRuyO5LIyNOINM2J7XJ1j7G5e1mm54yiG_G-STT5LLCjAmEXbFpDU8dt3XR6t2/s640/Day-7.jpg', 'pageName': 'Day 7 - Developing a Neural Machine Translation System from Scratch Part 1', 'pageTitle': 'Avnish Yadav: Day 7 - Developing a Neural Machine Translation System from Scratch Part 1', 'metaDescription': ''}}, {'name': 'features', 'data': {}}, {'name': 'messages', 'data': {'edit': 'Edit', 'linkCopiedToClipboard': 'Link copied to clipboard', 'ok': 'Ok', 'postLink': 'Post link'}}, {'name': 'template', 'data': {'name': 'custom', 'localizedName': 'Custom', 'isResponsive': true, 'isAlternateRendering': false, 'isCustom': true}}, {'name': 'view', 'data': {'classic': {'name': 'classic', 'url': '?view\x3dclassic'}, 'flipcard': {'name': 'flipcard', 'url': '?view\x3dflipcard'}, 'magazine': {'name': 'magazine', 'url': '?view\x3dmagazine'}, 'mosaic': {'name': 'mosaic', 'url': '?view\x3dmosaic'}, 'sidebar': {'name': 'sidebar', 'url': '?view\x3dsidebar'}, 'snapshot': {'name': 'snapshot', 'url': '?view\x3dsnapshot'}, 'timeslide': {'name': 'timeslide', 'url': '?view\x3dtimeslide'}, 'isMobile': false, 'title': 'Day 7 - Developing a Neural Machine Translation System from Scratch Part 1', 'description': 'A blog about Machine learning, Salesforce,  Web Designing, Programming, Electronics, Tech Info and latest news related to Physics and Computers.', 'featuredImage': 'https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgLqn-uJrgN3kAv4eqHtTpJVxFZttvlJuLFHXrUqW1z_mcbU9GlbyNFHv2ujDLesH2AWbwkijtRPrjqfWuRuyO5LIyNOINM2J7XJ1j7G5e1mm54yiG_G-STT5LLCjAmEXbFpDU8dt3XR6t2/s640/Day-7.jpg', 'url': 'https://www.avnishyadav.com/2018/09/day-7-developing-neural-machine.html', 'type': 'item', 'isSingleItem': true, 'isMultipleItems': false, 'isError': false, 'isPage': false, 'isPost': true, 'isHomepage': false, 'isArchive': false, 'isLabelSearch': false, 'postId': 1557072364076999890}}, {'name': 'widgets', 'data': [{'title': 'Upload Image', 'type': 'Image', 'sectionId': 'upload-image', 'id': 'Image10'}, {'title': 'Search This Blog', 'type': 'BlogSearch', 'sectionId': 'upload-image', 'id': 'BlogSearch1'}, {'type': 'AdSense', 'sectionId': 'upload-image', 'id': 'AdSense1'}, {'type': 'AdSense', 'sectionId': 'upload-image', 'id': 'AdSense2'}, {'type': 'Attribution', 'sectionId': 'upload-image', 'id': 'Attribution1'}, {'title': 'Popular Posts', 'type': 'PopularPosts', 'sectionId': 'upload-image', 'id': 'PopularPosts1', 'posts': [{'title': 'How to Install and Use DeepSeek R1: A Step-by-Step Guide', 'id': 6775953060252261157}, {'title': 'How to Build a Simple CSV Parser with Lightning Web Components', 'id': 392193078338306824}, {'title': 'Adding and Removing Styles  on a Lightning Component during runtime.', 'id': 6696642424647228466}]}, {'title': '', 'type': 'PageList', 'sectionId': 'upload-image', 'id': 'PageList1'}, {'title': 'Logo', 'type': 'HTML', 'sectionId': 'header-main', 'id': 'HTML10'}, {'title': 'Icons, Dark, Search', 'type': 'LinkList', 'sectionId': 'header-main', 'id': 'LinkList10'}, {'title': 'Menu', 'type': 'LinkList', 'sectionId': 'header-main', 'id': 'LinkList11'}, {'title': 'Avnish\x27s Blog', 'type': 'HTML', 'sectionId': 'before-blog', 'id': 'HTML2'}, {'title': 'Featured Post', 'type': 'FeaturedPost', 'sectionId': 'before-blog', 'id': 'FeaturedPost1', 'postId': '5646276414962155158'}, {'type': 'AdSense', 'sectionId': 'before-post', 'id': 'AdSense4'}, {'title': 'Blog Posts', 'type': 'Blog', 'sectionId': 'blog-post', 'id': 'Blog1', 'posts': [{'id': '1557072364076999890', 'title': 'Day 7 - Developing a Neural Machine Translation System from Scratch Part 1', 'featuredImage': 'https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgLqn-uJrgN3kAv4eqHtTpJVxFZttvlJuLFHXrUqW1z_mcbU9GlbyNFHv2ujDLesH2AWbwkijtRPrjqfWuRuyO5LIyNOINM2J7XJ1j7G5e1mm54yiG_G-STT5LLCjAmEXbFpDU8dt3XR6t2/s640/Day-7.jpg', 'showInlineAds': false}], 'headerByline': {'regionName': 'header1', 'items': [{'name': 'share', 'label': ''}, {'name': 'author', 'label': 'by'}, {'name': 'timestamp', 'label': 'd MMM, yyyy'}]}, 'footerBylines': [{'regionName': 'footer1', 'items': [{'name': 'comments', 'label': 'Comment'}]}, {'regionName': 'footer2', 'items': [{'name': 'labels', 'label': ''}]}], 'allBylineItems': [{'name': 'share', 'label': ''}, {'name': 'author', 'label': 'by'}, {'name': 'timestamp', 'label': 'd MMM, yyyy'}, {'name': 'comments', 'label': 'Comment'}, {'name': 'labels', 'label': ''}]}, {'type': 'AdSense', 'sectionId': 'ads-post', 'id': 'AdSense6'}, {'title': '#You may also like', 'type': 'HTML', 'sectionId': 'ads-post', 'id': 'HTML15'}, {'type': 'AdSense', 'sectionId': 'ads-post', 'id': 'AdSense5'}, {'type': 'AdSense', 'sectionId': 'after-blog', 'id': 'AdSense7'}, {'title': 'Youtube - Subscribe Us', 'type': 'HTML', 'sectionId': 'sidebar-static', 'id': 'HTML3'}, {'title': 'Popular Posts', 'type': 'PopularPosts', 'sectionId': 'sidebar-static', 'id': 'PopularPosts10', 'posts': [{'title': 'How do I write a C program to implement a SRTF (Shortest Remaining Time First) scheduling algorithm, along with displaying the Gantt chart?', 'id': 3881567259149966802}, {'title': 'Simple Speech Recognition in Python', 'id': 2259871367337308242}, {'title': 'Meta Tag Explained', 'id': 8176947953431626135}, {'title': 'Trigger to count number of Contacts associated with an Account', 'id': 6639085565003346730}, {'title': 'Adding and Removing Styles  on a Lightning Component during runtime.', 'id': 6696642424647228466}, {'title': 'How to Build a Simple CSV Parser with Lightning Web Components', 'id': 392193078338306824}, {'title': 'Record search in the lightning component using Javascript', 'id': 3905288771382147194}, {'title': 'Day 10 - Simple Speech to Text Converter Using Speech Recognization in Python', 'id': 4487570960785074297}, {'title': 'LEARN PHP IN 10 DAYS', 'id': 3797819043996052083}, {'title': 'Sharing Sets in Communities', 'id': 2323527192551392393}]}, {'type': 'AdSense', 'sectionId': 'sidebar-static', 'id': 'AdSense8'}, {'title': 'Categories', 'type': 'Label', 'sectionId': 'sidebar-static', 'id': 'Label10'}, {'title': 'Hashtag', 'type': 'Label', 'sectionId': 'sidebar-static', 'id': 'Label11'}, {'title': 'Blog Archive', 'type': 'BlogArchive', 'sectionId': 'sidebar-static', 'id': 'BlogArchive10'}, {'title': '#Recent Post', 'type': 'HTML', 'sectionId': 'sidebar-static', 'id': 'HTML19'}, {'type': 'AdSense', 'sectionId': 'sidebar-sticky', 'id': 'AdSense9'}, {'title': 'About Us', 'type': 'HTML', 'sectionId': 'footer-widget', 'id': 'HTML21'}, {'type': 'AdSense', 'sectionId': 'footer-widget', 'id': 'AdSense10'}, {'title': 'Follow Us', 'type': 'LinkList', 'sectionId': 'footer-widget', 'id': 'LinkList14'}, {'title': 'Newsletter', 'type': 'HTML', 'sectionId': 'footer-widget', 'id': 'HTML22'}, {'title': 'Copyright', 'type': 'HTML', 'sectionId': 'copyright', 'id': 'HTML23'}, {'title': 'SVG Icons', 'type': 'HTML', 'sectionId': 'jet-options', 'id': 'HTML24'}]}]);
_WidgetManager._RegisterWidget('_ImageView', new _WidgetInfo('Image10', 'upload-image', document.getElementById('Image10'), {'resize': false}, 'displayModeFull'));
_WidgetManager._RegisterWidget('_BlogSearchView', new _WidgetInfo('BlogSearch1', 'upload-image', document.getElementById('BlogSearch1'), {}, 'displayModeFull'));
_WidgetManager._RegisterWidget('_AdSenseView', new _WidgetInfo('AdSense1', 'upload-image', document.getElementById('AdSense1'), {}, 'displayModeFull'));
_WidgetManager._RegisterWidget('_AdSenseView', new _WidgetInfo('AdSense2', 'upload-image', document.getElementById('AdSense2'), {}, 'displayModeFull'));
_WidgetManager._RegisterWidget('_AttributionView', new _WidgetInfo('Attribution1', 'upload-image', document.getElementById('Attribution1'), {}, 'displayModeFull'));
_WidgetManager._RegisterWidget('_PopularPostsView', new _WidgetInfo('PopularPosts1', 'upload-image', document.getElementById('PopularPosts1'), {}, 'displayModeFull'));
_WidgetManager._RegisterWidget('_PageListView', new _WidgetInfo('PageList1', 'upload-image', document.getElementById('PageList1'), {'title': '', 'links': [{'isCurrentPage': false, 'href': 'https://avnishyadav25.blogspot.com/', 'title': 'Home'}, {'isCurrentPage': false, 'href': 'http://avnishyadav25.blogspot.com/p/about-me.html', 'title': 'ABOUT ME'}, {'isCurrentPage': false, 'href': 'https://avnishyadav25.blogspot.com/search/label/machine%20learning', 'title': 'MACHINE LEARNING'}, {'isCurrentPage': false, 'href': 'https://avnishyadav25.blogspot.com/search/label/salesforce', 'title': 'SALESFORCE'}, {'isCurrentPage': false, 'href': 'http://avnishyadav25.blogspot.com/p/contact-me.html', 'title': 'CONTACT ME'}], 'mobile': false, 'showPlaceholder': true, 'hasCurrentPage': false}, 'displayModeFull'));
_WidgetManager._RegisterWidget('_HTMLView', new _WidgetInfo('HTML10', 'header-main', document.getElementById('HTML10'), {}, 'displayModeFull'));
_WidgetManager._RegisterWidget('_LinkListView', new _WidgetInfo('LinkList10', 'header-main', document.getElementById('LinkList10'), {}, 'displayModeFull'));
_WidgetManager._RegisterWidget('_LinkListView', new _WidgetInfo('LinkList11', 'header-main', document.getElementById('LinkList11'), {}, 'displayModeFull'));
_WidgetManager._RegisterWidget('_HTMLView', new _WidgetInfo('HTML2', 'before-blog', document.getElementById('HTML2'), {}, 'displayModeFull'));
_WidgetManager._RegisterWidget('_FeaturedPostView', new _WidgetInfo('FeaturedPost1', 'before-blog', document.getElementById('FeaturedPost1'), {}, 'displayModeFull'));
_WidgetManager._RegisterWidget('_AdSenseView', new _WidgetInfo('AdSense4', 'before-post', document.getElementById('AdSense4'), {}, 'displayModeFull'));
_WidgetManager._RegisterWidget('_BlogView', new _WidgetInfo('Blog1', 'blog-post', document.getElementById('Blog1'), {'cmtInteractionsEnabled': false, 'lightboxEnabled': true, 'lightboxModuleUrl': 'https://www.blogger.com/static/v1/jsbin/3155575284-lbx__en_gb.js', 'lightboxCssUrl': 'https://www.blogger.com/static/v1/v-css/123180807-lightbox_bundle.css'}, 'displayModeFull'));
_WidgetManager._RegisterWidget('_AdSenseView', new _WidgetInfo('AdSense6', 'ads-post', document.getElementById('AdSense6'), {}, 'displayModeFull'));
_WidgetManager._RegisterWidget('_HTMLView', new _WidgetInfo('HTML15', 'ads-post', document.getElementById('HTML15'), {}, 'displayModeFull'));
_WidgetManager._RegisterWidget('_AdSenseView', new _WidgetInfo('AdSense5', 'ads-post', document.getElementById('AdSense5'), {}, 'displayModeFull'));
_WidgetManager._RegisterWidget('_AdSenseView', new _WidgetInfo('AdSense7', 'after-blog', document.getElementById('AdSense7'), {}, 'displayModeFull'));
_WidgetManager._RegisterWidget('_HTMLView', new _WidgetInfo('HTML3', 'sidebar-static', document.getElementById('HTML3'), {}, 'displayModeFull'));
_WidgetManager._RegisterWidget('_PopularPostsView', new _WidgetInfo('PopularPosts10', 'sidebar-static', document.getElementById('PopularPosts10'), {}, 'displayModeFull'));
_WidgetManager._RegisterWidget('_AdSenseView', new _WidgetInfo('AdSense8', 'sidebar-static', document.getElementById('AdSense8'), {}, 'displayModeFull'));
_WidgetManager._RegisterWidget('_LabelView', new _WidgetInfo('Label10', 'sidebar-static', document.getElementById('Label10'), {}, 'displayModeFull'));
_WidgetManager._RegisterWidget('_LabelView', new _WidgetInfo('Label11', 'sidebar-static', document.getElementById('Label11'), {}, 'displayModeFull'));
_WidgetManager._RegisterWidget('_BlogArchiveView', new _WidgetInfo('BlogArchive10', 'sidebar-static', document.getElementById('BlogArchive10'), {'languageDirection': 'ltr', 'loadingMessage': 'Loading\x26hellip;'}, 'displayModeFull'));
_WidgetManager._RegisterWidget('_HTMLView', new _WidgetInfo('HTML19', 'sidebar-static', document.getElementById('HTML19'), {}, 'displayModeFull'));
_WidgetManager._RegisterWidget('_AdSenseView', new _WidgetInfo('AdSense9', 'sidebar-sticky', document.getElementById('AdSense9'), {}, 'displayModeFull'));
_WidgetManager._RegisterWidget('_HTMLView', new _WidgetInfo('HTML21', 'footer-widget', document.getElementById('HTML21'), {}, 'displayModeFull'));
_WidgetManager._RegisterWidget('_AdSenseView', new _WidgetInfo('AdSense10', 'footer-widget', document.getElementById('AdSense10'), {}, 'displayModeFull'));
_WidgetManager._RegisterWidget('_LinkListView', new _WidgetInfo('LinkList14', 'footer-widget', document.getElementById('LinkList14'), {}, 'displayModeFull'));
_WidgetManager._RegisterWidget('_HTMLView', new _WidgetInfo('HTML22', 'footer-widget', document.getElementById('HTML22'), {}, 'displayModeFull'));
_WidgetManager._RegisterWidget('_HTMLView', new _WidgetInfo('HTML23', 'copyright', document.getElementById('HTML23'), {}, 'displayModeFull'));
_WidgetManager._RegisterWidget('_HTMLView', new _WidgetInfo('HTML24', 'jet-options', document.getElementById('HTML24'), {}, 'displayModeFull'));
</script>
</body>*/</style>