Search code examples
pythonhtmlfileluaspeech-synthesis

Loop through a page of voice samples, downloading each sample and putting the lines into a text file


Here is the page I am trying to do this on. It is the voice lines of GLaDOS from Portal. Each line is inner "i" HTML text as well as between quotes as displayed on the page. They each have a direct download link beside them labeled "download". I'm trying to put the voice lines into the MARY TTS voice synthesizer here in one of two formats. Either every line in its own text file with the file name matching the name of the wav files, or all in one text file formatted as ( filename "insert line here" ).

I was trying to do this myself but I've already spent 4 hours on it and have gotten only a small piece of Python code that doesn't work.

from bs4 import BeautifulSoup
import re
import urllib.request
soup = BeautifulSoup(urllib.request.urlopen("http://theportalwiki.com/wiki/GLaDOS_voice_lines"), "html.parser")
tags = soup.find_all('i')
f = open('Lines.txt', 'w')
for t in range(len(tags)):
    f.write(tags[t] + '\n')

f.close()

It returns "TypeError: unsupported operand type(s) for +: 'Tag' and 'str'."

I also tried AutoHotKey.

^g::

IEGet(Name="")        ;Retrieve pointer to existing IE window/tab
{
    IfEqual, Name,, WinGetTitle, Name, ahk_class IEFrame
        Name := ( Name="New Tab - Windows Internet Explorer" ) ? "about:Tabs"
        : RegExReplace( Name, " - (Windows|Microsoft) Internet Explorer" )
    For wb in ComObjCreate( "Shell.Application" ).Windows
        If ( wb.LocationName = Name ) && InStr( wb.FullName, "iexplore.exe" )
            Return wb
} ;written by Jethrow

wb := IEGet()

IELoad(wb)    ;You need to send the IE handle to the function unless you define it as global.
{
    If !wb    ;If wb is not a valid pointer then quit
        Return False
    Loop    ;Otherwise sleep for .1 seconds untill the page starts loading
        Sleep,100
    Until (wb.busy)
    Loop    ;Once it starts loading wait until completes
        Sleep,100
    Until (!wb.busy)
    Loop    ;optional check to wait for the page to completely load
        Sleep,100
    Until (wb.Document.Readystate = "Complete")
Return True
}

For IE in ComObjCreate("Shell.Application").Windows ; for each open window
If InStr(IE.FullName, "iexplore.exe") ; check if it's an ie window
break ; keep that window's handle
; this assumes an ie window is available. it won't work if not

IE.Navigate("http://theportalwiki.com/wiki/GLaDOS_voice_lines")
While IE.Busy
    Sleep, 100
Links := IE.Document.Links

Inner := FileOpen("C:\Users\Johnson\Desktop\GLaDOS Voice", "w")
Rows := IE.Document.All.Tags("table")[4].Rows
    Loop % Rows.Length
        Inner.Write(Row[A_Index].InnerText . "`r`n")

Inner.Close()
Return

As far as I can tell, the AutoHotKey script does absolutely nothing. I use the hotkey and nothing happens.

I'd prefer Lua because it's consistent and I understand it.


Solution

  • Your Python code is very close to working. Minor fix (plus using a context manager for the file) below:

    from bs4 import BeautifulSoup
    import urllib.request
    soup = BeautifulSoup(urllib.request.urlopen("http://theportalwiki.com/wiki/GLaDOS_voice_lines"), "html.parser")
    tags = soup.find_all('i')
    with open('Lines.txt', 'w') as f:
        for t in range(len(tags)):
            f.write(tags[t].text.strip('“”') + '\n')
    

    Lines.txt:

    You just have to look at things objectively, see what you don't need anymore, and trim out the fat.
    Portal
    Portal 2
    
    Hello and, again, welcome to the Aperture Science computer-aided enrichment center.
    ...
    

    EDIT

    To answer the question in the comment below, this should get the download links:

    from bs4 import BeautifulSoup
    import urllib.request
    soup = BeautifulSoup(urllib.request.urlopen("http://theportalwiki.com/wiki/GLaDOS_voice_lines"), "html.parser")
    tags = soup.find_all('a')
    with open('Downloads.txt', 'w') as f:
        for tag in tags:
            if tag.text == 'Download':
                f.write(tag['href'] + '\n')
    

    Downloads.txt:

    http://i1.theportalwiki.net/img/e/e5/GLaDOS_00_part1_entry-1.wav
    http://i1.theportalwiki.net/img/d/d7/GLaDOS_00_part1_entry-2.wav
    http://i1.theportalwiki.net/img/5/50/GLaDOS_00_part1_entry-3.wav
    ...