Search code examples
pythonpinyin

Format text during Chinese to English conversion


I'm using python 3.6.8, other versions are not usable. I need to convert a pandas column from Chinese to English, which contains around 20% Chinese text. Due to client requirements, I cannot use a translation API or library like Google Translate; instead I must use the pinyin package.

So I wrote the following code

import pinyin

df['Pinyin_Text'] = df['Chinese_Text'].apply(lambda text: pinyin.get(text,format="strip", delimiter=" "))

But I'm seeing that my Pinyin_Text field is providing phonetic transcription. I would like to format my Pinyin_Text field. Can you suggest to me how I can achieve that?


Solution

  • You can achieve this goal by giving Style.Normal parameter, some version changes will be required in your current setup if you are using a different library version, for this example the version I am using is 0.53:

    pip install pypinyin
    

    The code to get the desired result using the delimiter " " works as follows:

    #using the version of library pypinyin-0.53
    from pypinyin import pinyin, Style
    
    texttogetphonetictoenglishalpha = "你好 你好,世界 你好"
    pinyin_textcomplete = pinyin(texttogetphonetictoenglishalpha, style=Style.NORMAL)
    pinyin_resultonlyEnglisAlpha = ' '.join([word[0] for word in pinyin_textcomplete])
    
    print(pinyin_resultonlyEnglisAlpha)
    

    The result returns alphabetic pronunciation in English instead of Phonetic symbols, I did not try with numbers, this should do the task.

    enter image description here