Search code examples
pythonpython-dateutil

Converting multi language date time formats to "%Y-%m-%d"


I'm scraping a references from the bottom of pages on wikipedia. These references contain an OpenUrl link which I can parse. Here's an example:

<span 
    title="ctx_ver=Z39.88-2004&amp;
    rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&amp;
    rft.genre=unknown&amp;
    rft.jtitle=The+Tennessean&amp;
    rft.atitle=Belmont+University+awarded+final+2020+presidential+debate&amp;
    rft.date=2019-10-11&amp;
    rft.aulast=Tamburin&amp;
    rft.aufirst=Adam&amp;
    rft_id=https%3A%2F%2Fwww.tennessean.com%2Fstory%2Fnews%2F2019%2F10%2F11%2Fbelmont-university-nashville-hosts-presidential-debate-2020%2F3941983002%2F&amp;
    rfr_id=info%3Asid%2Fen.wikipedia.org%3A2020+United+States+presidential+election" 

    class="Z3988">
</span>

I'm successfully obtaining the rft.date value. However the format of the value varies. I'm attempting to do two things:

  1. 'Guess' the language and translate it (if possible)
  2. Identify the format and reformat to "%Y-%m-%d"

Without the language issue I would be able to use dateutil (see half way down the page). However, the language issue stumps me completely.

Does anyone have any suggestions for how to deal with the translation on examples like this?

0 "մայիսի 8, 2019"
1 "մայիսի 6, 2019"
2 "մայիսի 10, 2019"
3 "June 20, 2019"
4 "January 16, 2019"
5 "Aug 8, 2019"
6 "Aug 4, 2019"
...
12 "9 August 2019"
13 "8 May 2019"
14 "8 July 2020"
15 "8 July 2019"
16 "8 January 2020"
17 "8 de enero de 2020"
18 "7 tháng 8 năm 2019"
19 "7 May 2020"
...
33 "31 de diciembre de 2019"
...
40 "28 December 2019"
41 "28 de diciembre de 2019"
42 "27 de septiembre de 2019"
43 "26 November 2019"
44 "25 tháng 6 năm 2019"
45 "25 May 2019"
46 "25 March 2020"
47 "25 June 2019"
48 "24 June 2019"
49 "23 July 2019"
50 "22 tháng 7 năm 2019"
51 "22 July 2020"
52 "22 de abril de 2019"
53 "21 August 2019"
54 "2020-10-18"
55 "2020-09-21"
56 "2020-09-19"
57 "2020-09-16"

Solution

  • You could use googletrans python library to achieve your goal. I tried it locally and seems to work well.

    Here is the code:

    import pandas as pd
    from googletrans import Translator
    
    translator = Translator()
    
    df = pd.read_csv('input_file.tsv', sep=' ', header=None, index_col=0)
    df.columns = ['date']
    
    df['translated'] = df['date'].map(lambda x: translator.translate(x).text)
    print(df)
    

    Output:

                            date         translated                                             
    0             մայիսի 8, 2019        May 8, 2019
    1             մայիսի 6, 2019        May 6, 2019
    2            մայիսի 10, 2019       May 10, 2019
    3              June 20, 2019      June 20, 2019
    4           January 16, 2019   January 16, 2019
    5                Aug 8, 2019        Aug 8, 2019
    6                Aug 4, 2019        Aug 4, 2019
    12             9 August 2019      9 August 2019
    13                8 May 2019         8 May 2019
    14               8 July 2020        8 July 2020
    15               8 July 2019        8 July 2019
    16            8 January 2020     8 January 2020
    17        8 de enero de 2020    January 8, 2020
    18        7 tháng 8 năm 2019     August 7, 2019
    19                7 May 2020         7 May 2020
    33   31 de diciembre de 2019  December 31, 2019
    40          28 December 2019   28 December 2019
    41   28 de diciembre de 2019       Dec 28, 2019
    42  27 de septiembre de 2019       Sep 27, 2019
    43          26 November 2019   26 November 2019
    44       25 tháng 6 năm 2019      June 25, 2019
    45               25 May 2019        25 May 2019
    46             25 March 2020      25 March 2020
    47              25 June 2019       25 June 2019
    48              24 June 2019       24 June 2019
    49              23 July 2019       23 July 2019
    50       22 tháng 7 năm 2019      July 22, 2019
    51              22 July 2020       22 July 2020
    52       22 de abril de 2019       Apr 22, 2019
    53            21 August 2019     21 August 2019
    54                2020-10-18         2020-10-18
    55                2020-09-21         2020-09-21
    56                2020-09-19         2020-09-19
    57                2020-09-16         2020-09-16