Search code examples
azure-cognitive-servicesmicrosoft-translator

Maintaining the keys of Markdown's metadata in the source language


This question concerns the Microsoft Cognitive Services, specifically their Azure AI Translator service. The translator exposes two types of APIs for document translation:

  • Async translation API
  • Sync translation API

Where the difference can be found here in their source documentation. So the docs say that only the async API supports the .md formatted input, and my job is to translate a load of documentation (which is written in Markdown) using this service, into a specified language. The challenge I am facing is quite significant as the whole document gets translated, even the .md metadata which is always at the top of the document in the form of:

--- 
Title: Elasticsearch guidelines 
Description: Guidelines for Elasticsearch.
 weight: 30
 --- 

Now, I did not find any documentation on a mechanism that allows me to tell the API to either leave the metadata out of translation or at least only translate the values and not the keys (the keys are essential since later on in the pipeline we need the metadata to generate html from these markdown files). I'm really curious if anyone has any prior experience with this or a viable solution, and I'm open to discussion.


Solution

  • I did not find any documentation on a mechanism that allows me to tell the API to either leave the metadata out of translation or at least only translate the values and not the keys (the keys are essential since later on in the pipeline we need the metadata to generate html from these markdown files).

    Preprocess the Markdown documents before sending them for translation.

    1. Parse the Markdown documents to extract the metadata section.
    2. Translate only the main content of the Markdown document using the async translation API.
    3. Merge the translated content with the original metadata.
    4. If necessary, translate the metadata values separately (since the keys are usually fixed and don't require translation).
    5. Reconstruct the translated Markdown document with the translated content and original metadata.

    Code:

    import re
    from azure.ai.textanalytics import DocumentTranslationClient
    from azure.core.credentials import AzureKeyCredential
    
    # Initialize your Azure Cognitive Services Translator credentials
    subscription_key = "YOUR_SUBSCRIPTION_KEY"
    endpoint = "YOUR_TRANSLATOR_ENDPOINT"
    credential = AzureKeyCredential(subscription_key)
    
    # Sample Markdown content
    markdown_content = """
    ---
    Title: Elasticsearch guidelines
    Description: Guidelines for Elasticsearch.
    Weight: 30
    ---
    
    # Introduction
    This is the introduction section of your document.
    ...
    """
    
    # Extract metadata and main content
    metadata_pattern = r"---\n(.*?)\n---"
    metadata_match = re.search(metadata_pattern, markdown_content, re.DOTALL)
    if metadata_match:
        metadata_section = metadata_match.group(1).strip()
        main_content = markdown_content.replace(metadata_match.group(0), "").strip()
    else:
        raise ValueError("Metadata section not found in the Markdown content.")
    
    # Translate the main content
    async def translate_content():
        client = DocumentTranslationClient(endpoint, credential)
        source_language = "en"
        target_language = "fr"  # Change this to your desired target language
    
        async with client:
            result = await client.begin_translate_document(
                source_language,
                target_language,
                [main_content],
                content_type="text/markdown",
            )
            translation_result = await result.result()
    
        translated_main_content = translation_result[0].translated_document.content
        return translated_main_content
    
    translated_main_content = translate_content()
    
    # Reassemble the translated document
    translated_markdown = f"{metadata_section}\n\n{translated_main_content}"
    
    print(translated_markdown)
    
    • Use the Async Translation API to translate the main content (excluding metadata). Combine the translated main content with the translated metadata.

    Original Markdown:

    ---
    Title: Elasticsearch guidelines
    Description: Guidelines for Elasticsearch.
    Weight: 30
    ---
    
    # Introduction
    

    After Translation:

    ---
    Title: Translated title (Elasticsearch guidelines)
    Description: Translated description (Guidelines for Elasticsearch.)
    Weight: 30
    ---
    
    # Translated Introduction