azure-cognitive-services microsoft-translator

Maintaining the keys of Markdown's metadata in the source language

This question concerns the Microsoft Cognitive Services, specifically their Azure AI Translator service. The translator exposes two types of APIs for document translation:

Async translation API
Sync translation API

Where the difference can be found here in their source documentation. So the docs say that only the async API supports the .md formatted input, and my job is to translate a load of documentation (which is written in Markdown) using this service, into a specified language. The challenge I am facing is quite significant as the whole document gets translated, even the .md metadata which is always at the top of the document in the form of:

--- 
Title: Elasticsearch guidelines 
Description: Guidelines for Elasticsearch.
 weight: 30
 ---

Now, I did not find any documentation on a mechanism that allows me to tell the API to either leave the metadata out of translation or at least only translate the values and not the keys (the keys are essential since later on in the pipeline we need the metadata to generate html from these markdown files). I'm really curious if anyone has any prior experience with this or a viable solution, and I'm open to discussion.

Solution

I did not find any documentation on a mechanism that allows me to tell the API to either leave the metadata out of translation or at least only translate the values and not the keys (the keys are essential since later on in the pipeline we need the metadata to generate html from these markdown files).

Preprocess the Markdown documents before sending them for translation.

Parse the Markdown documents to extract the metadata section.
Translate only the main content of the Markdown document using the async translation API.
Merge the translated content with the original metadata.
If necessary, translate the metadata values separately (since the keys are usually fixed and don't require translation).
Reconstruct the translated Markdown document with the translated content and original metadata.

Code:

import re
from azure.ai.textanalytics import DocumentTranslationClient
from azure.core.credentials import AzureKeyCredential

# Initialize your Azure Cognitive Services Translator credentials
subscription_key = "YOUR_SUBSCRIPTION_KEY"
endpoint = "YOUR_TRANSLATOR_ENDPOINT"
credential = AzureKeyCredential(subscription_key)

# Sample Markdown content
markdown_content = """
---
Title: Elasticsearch guidelines
Description: Guidelines for Elasticsearch.
Weight: 30
---

# Introduction
This is the introduction section of your document.
...
"""

# Extract metadata and main content
metadata_pattern = r"---\n(.*?)\n---"
metadata_match = re.search(metadata_pattern, markdown_content, re.DOTALL)
if metadata_match:
    metadata_section = metadata_match.group(1).strip()
    main_content = markdown_content.replace(metadata_match.group(0), "").strip()
else:
    raise ValueError("Metadata section not found in the Markdown content.")

# Translate the main content
async def translate_content():
    client = DocumentTranslationClient(endpoint, credential)
    source_language = "en"
    target_language = "fr"  # Change this to your desired target language

    async with client:
        result = await client.begin_translate_document(
            source_language,
            target_language,
            [main_content],
            content_type="text/markdown",
        )
        translation_result = await result.result()

    translated_main_content = translation_result[0].translated_document.content
    return translated_main_content

translated_main_content = translate_content()

# Reassemble the translated document
translated_markdown = f"{metadata_section}\n\n{translated_main_content}"

print(translated_markdown)

Use the Async Translation API to translate the main content (excluding metadata). Combine the translated main content with the translated metadata.

Original Markdown:

---
Title: Elasticsearch guidelines
Description: Guidelines for Elasticsearch.
Weight: 30
---

# Introduction

After Translation:

---
Title: Translated title (Elasticsearch guidelines)
Description: Translated description (Guidelines for Elasticsearch.)
Weight: 30
---

# Translated Introduction