Search code examples
pythonpython-3.xopenai-apiclaude

Python: Compare html tags in RO folder with their corresponding tags in EN folder and displays in Output the unique tags from both files


In short, I have two files, one in Romanian, the other has been translated into English. In the RO file there are some tags that have not been translated into EN. So I want to display in an html output all the tags in EN that have corresponding tags in RO, but also those tags in RO that do not appear in EN.

I have this files:

   ro_file_path = r'd:\3\ro\incotro-vezi-tu-privire.html'
   en_file_path = r'd:\3\en\where-do-you-see-look.html'
   Output =  d:\3\Output\where-do-you-see-look.html 

TASK: Compare the 3 tags below, in both files.

<p class="text_obisnuit">(.*?)</p>
<p class="text_obisnuit2">(.*?)</p>
<p class="text_obisnuit"><span class="text_obisnuit2">(.*?)</span>(.*?)</p>

Requirements:

  • All tags are enclosed between: <!-- START ARTICLE --> and <!-- FINAL ARTICLE -->
  • Count the tags in RO and count the tags in EN, and compare.
  • Then count the words in the tags in RO and compare with the number of words in the tags in EN.
  • Compares the html tags in RO with the html tags in EN, in order, and displays in Output the unique tags from both files

RO d:\3\ro\incotro-vezi-tu-privire.html

<!-- ARTICOL START --> 
<p class="text_obisnuit"><span class="text_obisnuit2">Stiu ca este dificil sa conduci la inceput, </span>dar dupa 4-5 luni inveti.</p> 
<p class="text_obisnuit2">Imi place sa merg la scoala si sa invat, mai ales in timpul saptamanii.</p> 
<p class="text_obisnuit">Sunt un bun conducator auto, dar am facut si greseli din care am invatat.</p> 
<p class="text_obisnuit">În fond, cele scrise de mine, sunt adevarate.</p> 
<p class="text_obisnuit">Iubesc sa conduc masina.</p> 
<p class="text_obisnuit"><span class="text_obisnuit2">Ma iubesti?</p> 
<p class="text_obisnuit"><span class="text_obisnuit2">Stiu ca este dificil sa conduci la inceput, </span>dar dupa 4-5 luni inveti.</p> 
<p class="text_obisnuit">Totul se repetă, chiar și ochii care nu se vad.</p> 
<p class="text_obisnuit2">BEE servesc o cafea 2 mai buna</p> 
<!-- ARTICOL FINAL -->

   

EN d:\3\en\where-do-you-see-look.html

<!-- ARTICOL START -->
<p class="text_obisnuit2">I like going to school and learning, especially during the week.</p>
<p class="text_obisnuit">I'm a good driver, but I've also made mistakes that I've learned from.</p>
<p class="text_obisnuit">Basically, what I wrote is true.</p>
<p class="text_obisnuit">I love driving.</p>
<p class="text_obisnuit"><span class="text_obisinuit2">I know it's difficult to drive at first, </span> but after 4-5 months you learn.</p>
<p class="text_obisnuit">Everything is repeated, even the eyes that can't see.</p>
<!-- ARTICOL FINAL -->

Expected OUTPUT: d:\3\Output\where-do-you-see-look.html

<!-- ARTICOL START -->
<p class="text_obisnuit"><span class="text_obisnuit2">Stiu ca este dificil sa conduci la inceput, </span> dar dupa 4-5 luni inveti.</p> 
<p class="text_obisnuit2">I like going to school and learning, especially during the week.</p>
<p class="text_obisnuit">I'm a good driver, but I've also made mistakes that I've learned from.</p>
<p class="text_obisnuit">Basically, what I wrote is true.</p>
<p class="text_obisnuit"><span class="text_obisnuit2">Ma iubesti?</p> 
<p class="text_obisnuit">I love driving.</p>
<p class="text_obisnuit"><span class="text_obisinuit2">I know it's difficult to drive at first, </span> but after 4-5 months you learn.</p>
<p class="text_obisnuit">Everything is repeated, even the eyes that can't see.</p>
<p class="text_obisnuit2">BEE servesc o cafea 2 mai buna</p> 
<!-- ARTICOL FINAL -->

Python code must compares the html tags in RO with the html tags in EN and displays in Output the unique tags in both files, taking into account that most of the tags in RO have their corresponding translation in the tags in EN. But the idea of ​​the code is that the code also finds those html tags in RO that were omitted from being translated into EN.

Here's how I came up with the solution in Python code. I followed a simple calculation.

First method:

First, you have to count all the tags in ro, then all the tags in en. Then you have to memorize each type of tag in ro, but then also in en. Then you have to count the words in each tag in ro and the words in each tag in en. Don't forget that there can be 2 identical tags, but on different lines, just like in RO. Then you have to statistically calculate the result. How much are the tags in ro minus the tags in en?

The second method, to verify the output, is to print the screen. Compare the entire ro part and the entire en part separately through OCR, then line by line, see which tags in ro are plus compared to the tags in en

PYTHON CODE:

import re
import os

def extract_tags(content):
    start = content.find('<!-- ARTICOL START -->')
    end = content.find('<!-- ARTICOL FINAL -->')
    if start == -1 or end == -1:
        raise ValueError("Marcajele 'ARTICOL START' sau 'ARTICOL FINAL' lipsesc.")

    section_content = content[start:end]
    pattern = re.compile(r'<p class="text_obisnuit(?:2)?">(?:<span class="text_obisnuit2">)?.*?</p>', re.DOTALL)
    tags = []

    for idx, match in enumerate(pattern.finditer(section_content), 1):
        tag = match.group(0)
        text = re.sub(r'<[^>]+>', '', tag).strip()

        if '<span class="text_obisnuit2">' in tag or '<span class="text_obisinuit2">' in tag:
            tag_type = 'span'
        elif 'class="text_obisnuit2"' in tag:
            tag_type = 'text_obisnuit2'
        else:
            tag_type = 'text_obisnuit'

        tags.append({
            'index': idx,
            'tag': tag,
            'text': text,
            'type': tag_type,
            'word_count': len(text.split())
        })
    return tags

def find_matching_pairs(ro_tags, en_tags):
    matched_indices = set()
    used_en = set()

    for i, ro_tag in enumerate(ro_tags):
        for j, en_tag in enumerate(en_tags):
            if j in used_en:
                continue

            if ro_tag['type'] == en_tag['type']:
                word_diff = abs(ro_tag['word_count'] - en_tag['word_count'])
                if word_diff <= 3:
                    matched_indices.add(i)
                    used_en.add(j)
                    break
    return matched_indices

def fix_duplicates(output_content, ro_content):
    """Corectează poziția tag-urilor duplicate"""
    ro_tags = extract_tags(ro_content)
    output_tags = extract_tags(output_content)

    # Găsim tag-urile care apar în RO și OUTPUT
    for ro_idx, ro_tag in enumerate(ro_tags):
        for out_idx, out_tag in enumerate(output_tags):
            if ro_tag['tag'] == out_tag['tag'] and ro_idx != out_idx:
                # Am găsit un tag care apare în poziții diferite
                # Verificăm dacă este cazul de duplicat care trebuie mutat
                ro_lines = ro_content.split('\n')
                out_lines = output_content.split('\n')

                if ro_tag['tag'] in ro_lines[ro_idx+1] and out_tag['tag'] in out_lines[out_idx+1]:
                    # Mutăm tag-ul la poziția corectă
                    out_lines.remove(out_tag['tag'])
                    out_lines.insert(ro_idx+1, out_tag['tag'])
                    output_content = '\n'.join(out_lines)
                    break

    return output_content

def generate_output(ro_tags, en_tags, original_content):
    start = original_content.find('<!-- ARTICOL START -->')
    end = original_content.find('<!-- ARTICOL FINAL -->')
    if start == -1 or end == -1:
        raise ValueError("Marcajele 'ARTICOL START' sau 'ARTICOL FINAL' lipsesc.")

    output_content = original_content[:start + len('<!-- ARTICOL START -->')] + "\n"
    matched_indices = find_matching_pairs(ro_tags, en_tags)
    en_index = 0

    for i, ro_tag in enumerate(ro_tags):
        if i in matched_indices:
            output_content += en_tags[en_index]['tag'] + "\n"
            en_index += 1
        else:
            output_content += ro_tag['tag'] + "\n"

    while en_index < len(en_tags):
        output_content += en_tags[en_index]['tag'] + "\n"
        en_index += 1

    output_content += original_content[end:]
    return output_content

def main():
    try:
        ro_file_path = r'd:\3\ro\incotro-vezi-tu-privire.html'
        en_file_path = r'd:\3\en\where-do-you-see-look.html'
        output_file_path = r'd:\3\Output\where-do-you-see-look.html'

        with open(ro_file_path, 'r', encoding='utf-8') as ro_file:
            ro_content = ro_file.read()
        with open(en_file_path, 'r', encoding='utf-8') as en_file:
            en_content = en_file.read()

        ro_tags = extract_tags(ro_content)
        en_tags = extract_tags(en_content)

        # Generăm primul output
        initial_output = generate_output(ro_tags, en_tags, en_content)

        # Corectăm pozițiile tag-urilor duplicate
        final_output = fix_duplicates(initial_output, ro_content)

        with open(output_file_path, 'w', encoding='utf-8') as output_file:
            output_file.write(final_output)

        print(f"Output-ul a fost generat la {output_file_path}")

    except Exception as e:
        print(f"Eroare: {str(e)}")

if __name__ == "__main__":
    main()

My Python code is almost perfect, but not perfect. The problem occurs when I introduce other tags in RO, such as:

<!-- ARTICOL START --> 
<p class="text_obisnuit">Laptopul meu este de culoare neagra.</p>
<p class="text_obisnuit2">Imi place sa merg la scoala si sa invat, mai ales in timpul saptamanii.</p> 
<p class="text_obisnuit">Sunt un bun conducator auto, dar am facut si greseli din care am invatat.</p> 
<p class="text_obisnuit"><span class="text_obisnuit2">Stiu ca este dificil sa conduci la inceput, </span>dar dupa 4-5 luni inveti.</p>
<p class="text_obisnuit">În fond, cele scrise de mine, sunt adevarate.</p> 
<p class="text_obisnuit">Iubesc sa conduc masina.</p> 

<p class="text_obisnuit"><span class="text_obisnuit2">Stiu ca este dificil sa conduci la inceput, </span>dar dupa 4-5 luni inveti.</p>
<p class="text_obisnuit">Totul se repetă, chiar și ochii care nu se vad.</p> 

<!-- ARTICOL FINAL -->

Solution

  • SECOND, and the BEST SOLUTION.

    Finally I solved the problem, but not with ChatGPT or Claude. No other AI could find the solution, because it didn't know how to think about the solution.

    In fact, to find the solution to this problem, you had to assign some identifiers to each tag, and do multiple searches.

    ChatGPT or Claude, or other AIs, will have to seriously consider this type of solution for such problems.

    Here are the specifications, the way I thought about solving the problem. It's a different way of thinking about doing PARSINGS.

    https://pastebin.com/as2yw1UQ

    Python code made by a friend of mine. I think the solution, he made the code:

    from bs4 import BeautifulSoup
    import re
    
    def count_words(text):
        """Numără cuvintele dintr-un text."""
        return len(text.strip().split())
    
    def get_greek_identifier(word_count):
        """Determină identificatorul grecesc bazat pe numărul de cuvinte."""
        if word_count < 7:
            return 'α'
        elif word_count <= 14:
            return 'β'
        else:
            return 'γ'
    
    def get_tag_type(tag):
        """Determină tipul tagului (A, B, sau C)."""
        if tag.find('span'):
            return 'A'
        elif 'text_obisnuit2' in tag.get('class', []):
            return 'B'
        return 'C'
    
    def analyze_tags(content):
        """Analizează tagurile și returnează informații despre fiecare tag."""
        soup = BeautifulSoup(content, 'html.parser')
        tags_info = []
    
        article_content = re.search(r'<!-- ARTICOL START -->(.*?)<!-- ARTICOL FINAL -->',
                                  content, re.DOTALL)
    
        if article_content:
            content = article_content.group(1)
            soup = BeautifulSoup(content, 'html.parser')
    
        for i, tag in enumerate(soup.find_all('p', recursive=False)):
            text_content = tag.get_text(strip=True)
            tag_type = get_tag_type(tag)
            word_count = count_words(text_content)
            greek_id = get_greek_identifier(word_count)
    
            tags_info.append({
                'number': i + 1,
                'type': tag_type,
                'greek': greek_id,
                'content': str(tag),
                'text': text_content
            })
    
        return tags_info
    
    def compare_tags(ro_tags, en_tags):
        """Compară tagurile și găsește diferențele."""
        wrong_tags = []
        i = 0
        j = 0
    
        while i < len(ro_tags):
            ro_tag = ro_tags[i]
            if j >= len(en_tags):
                wrong_tags.append(ro_tag)
                i += 1
                continue
    
            en_tag = en_tags[j]
    
            if ro_tag['type'] != en_tag['type']:
                wrong_tags.append(ro_tag)
                i += 1
                continue
    
            i += 1
            j += 1
    
        return wrong_tags
    
    def format_results(wrong_tags):
        """Formatează rezultatele pentru afișare și salvare."""
        type_counts = {'A': 0, 'B': 0, 'C': 0}
        type_content = {'A': [], 'B': [], 'C': []}
    
        for tag in wrong_tags:
            type_counts[tag['type']] += 1
            type_content[tag['type']].append(tag['content'])
    
        # Creăm rezultatul formatat
        result = []
    
        # Prima linie cu sumarul
        summary_parts = []
        for tag_type in ['A', 'B', 'C']:
            if type_counts[tag_type] > 0:
                summary_parts.append(f"{type_counts[tag_type]} taguri de tipul ({tag_type})")
        result.append("In RO exista in plus fata de EN urmatoarele: " + " si ".join(summary_parts))
    
        # Detaliile pentru fiecare tip de tag
        for tag_type in ['A', 'B', 'C']:
            if type_counts[tag_type] > 0:
                result.append(f"\n{type_counts[tag_type]}({tag_type}) adica asta {'taguri' if type_counts[tag_type] > 1 else 'tag'}:")
                for content in type_content[tag_type]:
                    result.append(content)
                result.append("")  # Linie goală pentru separare
    
        return "\n".join(result)
    
    def merge_content(ro_tags, en_tags, wrong_tags):
        """Combină conținutul RO și EN, inserând tagurile wrong în pozițiile lor originale."""
        merged_tags = []
    
        # Creăm un dicționar pentru tagurile wrong indexat după numărul lor original
        wrong_dict = {tag['number']: tag for tag in wrong_tags}
    
        # Parcurgem pozițiile și decidem ce tag să punem în fiecare poziție
        current_en_idx = 0
        for i in range(max(len(ro_tags), len(en_tags))):
            position = i + 1
    
            # Verificăm dacă această poziție este pentru un tag wrong
            if position in wrong_dict:
                merged_tags.append(wrong_dict[position]['content'])
            elif current_en_idx < len(en_tags):
                merged_tags.append(en_tags[current_en_idx]['content'])
                current_en_idx += 1
    
        return merged_tags
    
    def save_results(merged_content, results, output_path):
        """Salvează conținutul combinat și rezultatele în fișierul de output."""
        final_content = '<!-- REZULTATE ANALIZA -->\n'
        final_content += '<!-- ARTICOL START -->\n'
    
        # Adaugă conținutul combinat
        for tag in merged_content:
            final_content += tag + '\n'
    
        final_content += '<!-- ARTICOL FINAL -->\n'
        final_content += '<!-- FINAL REZULTATE ANALIZA -->\n'
    
        # Adaugă rezultatele analizei
        final_content += results
    
        # Salvează în fișier
        with open(output_path, 'w', encoding='utf-8') as file:
            file.write(final_content)
    
    # Citește fișierele
    with open(r'd:/3/ro/incotro-vezi-tu-privire.html', 'r', encoding='utf-8') as file:
        ro_content = file.read()
    
    with open(r'd:/3/en/where-do-you-see-look.html', 'r', encoding='utf-8') as file:
        en_content = file.read()
    
    # Definește calea pentru fișierul de output
    output_path = r'd:/3/Output/where-do-you-see-look.html'
    
    # Analizează tagurile
    ro_tags = analyze_tags(ro_content)
    en_tags = analyze_tags(en_content)
    
    # Găsește diferențele
    wrong_tags = compare_tags(ro_tags, en_tags)
    
    # Formatează rezultatele
    results = format_results(wrong_tags)
    
    # Generează conținutul combinat
    merged_content = merge_content(ro_tags, en_tags, wrong_tags)
    
    # Afișează rezultatele în consolă
    print(results)
    
    # Salvează rezultatele în fișierul de output
    save_results(merged_content, results, output_path)