I'm scraping a references from the bottom of pages on wikipedia. These references contain an OpenUrl link which I can parse. Here's an example:
<span
title="ctx_ver=Z39.88-2004&
rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&
rft.genre=unknown&
rft.jtitle=The+Tennessean&
rft.atitle=Belmont+University+awarded+final+2020+presidential+debate&
rft.date=2019-10-11&
rft.aulast=Tamburin&
rft.aufirst=Adam&
rft_id=https%3A%2F%2Fwww.tennessean.com%2Fstory%2Fnews%2F2019%2F10%2F11%2Fbelmont-university-nashville-hosts-presidential-debate-2020%2F3941983002%2F&
rfr_id=info%3Asid%2Fen.wikipedia.org%3A2020+United+States+presidential+election"
class="Z3988">
</span>
I'm successfully obtaining the rft.date
value. However the format of the value varies. I'm attempting to do two things:
"%Y-%m-%d"
Without the language issue I would be able to use dateutil (see half way down the page). However, the language issue stumps me completely.
Does anyone have any suggestions for how to deal with the translation on examples like this?
0 "մայիսի 8, 2019"
1 "մայիսի 6, 2019"
2 "մայիսի 10, 2019"
3 "June 20, 2019"
4 "January 16, 2019"
5 "Aug 8, 2019"
6 "Aug 4, 2019"
...
12 "9 August 2019"
13 "8 May 2019"
14 "8 July 2020"
15 "8 July 2019"
16 "8 January 2020"
17 "8 de enero de 2020"
18 "7 tháng 8 năm 2019"
19 "7 May 2020"
...
33 "31 de diciembre de 2019"
...
40 "28 December 2019"
41 "28 de diciembre de 2019"
42 "27 de septiembre de 2019"
43 "26 November 2019"
44 "25 tháng 6 năm 2019"
45 "25 May 2019"
46 "25 March 2020"
47 "25 June 2019"
48 "24 June 2019"
49 "23 July 2019"
50 "22 tháng 7 năm 2019"
51 "22 July 2020"
52 "22 de abril de 2019"
53 "21 August 2019"
54 "2020-10-18"
55 "2020-09-21"
56 "2020-09-19"
57 "2020-09-16"
You could use googletrans python library to achieve your goal. I tried it locally and seems to work well.
Here is the code:
import pandas as pd
from googletrans import Translator
translator = Translator()
df = pd.read_csv('input_file.tsv', sep=' ', header=None, index_col=0)
df.columns = ['date']
df['translated'] = df['date'].map(lambda x: translator.translate(x).text)
print(df)
Output:
date translated
0 մայիսի 8, 2019 May 8, 2019
1 մայիսի 6, 2019 May 6, 2019
2 մայիսի 10, 2019 May 10, 2019
3 June 20, 2019 June 20, 2019
4 January 16, 2019 January 16, 2019
5 Aug 8, 2019 Aug 8, 2019
6 Aug 4, 2019 Aug 4, 2019
12 9 August 2019 9 August 2019
13 8 May 2019 8 May 2019
14 8 July 2020 8 July 2020
15 8 July 2019 8 July 2019
16 8 January 2020 8 January 2020
17 8 de enero de 2020 January 8, 2020
18 7 tháng 8 năm 2019 August 7, 2019
19 7 May 2020 7 May 2020
33 31 de diciembre de 2019 December 31, 2019
40 28 December 2019 28 December 2019
41 28 de diciembre de 2019 Dec 28, 2019
42 27 de septiembre de 2019 Sep 27, 2019
43 26 November 2019 26 November 2019
44 25 tháng 6 năm 2019 June 25, 2019
45 25 May 2019 25 May 2019
46 25 March 2020 25 March 2020
47 25 June 2019 25 June 2019
48 24 June 2019 24 June 2019
49 23 July 2019 23 July 2019
50 22 tháng 7 năm 2019 July 22, 2019
51 22 July 2020 22 July 2020
52 22 de abril de 2019 Apr 22, 2019
53 21 August 2019 21 August 2019
54 2020-10-18 2020-10-18
55 2020-09-21 2020-09-21
56 2020-09-19 2020-09-19
57 2020-09-16 2020-09-16