Attention! This text has been automatically translated!
Since I made so many mistakes in my first Blog post, I write this post in German and have it automatically translated.
For translation I use the popular NLP framework of huggingface.co. On their website is a simple example to implement a translation application and I will use it.
As expected, the Markdown syntax does not immediately work correctly when translating. So I had to make some adjustments at the beginning and afterwards.
The code (including pre- and post-processing) I used for the translation of the markdown files can be found here. But since it’s just a few lines of code, we can also look at it here:
from transformers import MarianMTModel, MarianTokenizer
# load pretrained model and tokenizer
= 'Helsinki-NLP/opus-mt-de-en'
model_name = MarianTokenizer.from_pretrained(model_name)
tokenizer = MarianMTModel.from_pretrained(model_name)
model
# load german block post
= open("blog_translator_de.md", "r")
f_in = f_in.readlines()
src_text
f_in.close()
# preprocessing
## line break (\n) results to "I don't know." We make it more specific:
= [s.replace('\n',' ') for s in src_text]
src_text
## remove code block
= []
code = False
inside_code_block for i, line in enumerate(src_text):
if line.startswith('```') and not inside_code_block:
# entering codeblock
= True
inside_code_block += [line]
code = '<<code_block>>'
src_text[i] elif inside_code_block and not line.startswith('```'):
+= [line]
code = '<<code_block>>'
src_text[i] elif inside_code_block and line.startswith('```'):
# leaving code block
+= [line]
code = '<<code_block>>'
src_text[i] = False
inside_code_block
# translate
= model.generate(**tokenizer.prepare_seq2seq_batch(src_text, return_tensors="pt"))
translated = [tokenizer.decode(t, skip_special_tokens=True) for t in translated]
tgt_text
# postprocessing
## replace code_blog tags with code
for i, line in enumerate(tgt_text):
if line == '<<code_block>>':
= code.pop(0)
tgt_text[i]
## remove the eol (but keep empty list entries / lines)
= [s.replace('', '',) for s in tgt_text]
tgt_text ## remove space between ]( to get the md link syntax right
= [s.replace('](', '](',) for s in tgt_text]
tgt_text
# write english blog post
with open('2020-12-26-blog-translator.md', 'w') as f_out:
for line in tgt_text:
"%s\n" % line)
f_out.write( f_out.close()
Since this is my first NLP application, I left it with this Hello World code. Surely there are clever ways to map the markdown syntax in tokenizer. Maybe I’ll write a follow up when I find out.
By the way, the translation just made me adapt my German writing style. For example, sarcasm doesn’t work so well after translation, so I avoided it. Also, it often depends on the correct choice of words (e.g. there is no markdown command, but there is markdown syntax). <
Best regards
Johannes & the Robot