This question concerns the Microsoft Cognitive Services, specifically their Azure AI Translator service. The translator exposes two types of APIs for document translation:
Where the difference can be found here in their source documentation. So the docs say that only the async API supports the .md formatted input, and my job is to translate a load of documentation (which is written in Markdown) using this service, into a specified language. The challenge I am facing is quite significant as the whole document gets translated, even the .md metadata which is always at the top of the document in the form of:
---
Title: Elasticsearch guidelines
Description: Guidelines for Elasticsearch.
weight: 30
---
Now, I did not find any documentation on a mechanism that allows me to tell the API to either leave the metadata out of translation or at least only translate the values and not the keys (the keys are essential since later on in the pipeline we need the metadata to generate html from these markdown files). I'm really curious if anyone has any prior experience with this or a viable solution, and I'm open to discussion.
I did not find any documentation on a mechanism that allows me to tell the API to either leave the metadata out of translation or at least only translate the values and not the keys (the keys are essential since later on in the pipeline we need the metadata to generate html from these markdown files).
Preprocess the Markdown documents before sending them for translation.
Code:
import re
from azure.ai.textanalytics import DocumentTranslationClient
from azure.core.credentials import AzureKeyCredential
# Initialize your Azure Cognitive Services Translator credentials
subscription_key = "YOUR_SUBSCRIPTION_KEY"
endpoint = "YOUR_TRANSLATOR_ENDPOINT"
credential = AzureKeyCredential(subscription_key)
# Sample Markdown content
markdown_content = """
---
Title: Elasticsearch guidelines
Description: Guidelines for Elasticsearch.
Weight: 30
---
# Introduction
This is the introduction section of your document.
...
"""
# Extract metadata and main content
metadata_pattern = r"---\n(.*?)\n---"
metadata_match = re.search(metadata_pattern, markdown_content, re.DOTALL)
if metadata_match:
metadata_section = metadata_match.group(1).strip()
main_content = markdown_content.replace(metadata_match.group(0), "").strip()
else:
raise ValueError("Metadata section not found in the Markdown content.")
# Translate the main content
async def translate_content():
client = DocumentTranslationClient(endpoint, credential)
source_language = "en"
target_language = "fr" # Change this to your desired target language
async with client:
result = await client.begin_translate_document(
source_language,
target_language,
[main_content],
content_type="text/markdown",
)
translation_result = await result.result()
translated_main_content = translation_result[0].translated_document.content
return translated_main_content
translated_main_content = translate_content()
# Reassemble the translated document
translated_markdown = f"{metadata_section}\n\n{translated_main_content}"
print(translated_markdown)
Original Markdown:
---
Title: Elasticsearch guidelines
Description: Guidelines for Elasticsearch.
Weight: 30
---
# Introduction
After Translation:
---
Title: Translated title (Elasticsearch guidelines)
Description: Translated description (Guidelines for Elasticsearch.)
Weight: 30
---
# Translated Introduction