Here is the page I am trying to do this on. It is the voice lines of GLaDOS from Portal. Each line is inner "i" HTML text as well as between quotes as displayed on the page. They each have a direct download link beside them labeled "download". I'm trying to put the voice lines into the MARY TTS voice synthesizer here in one of two formats. Either every line in its own text file with the file name matching the name of the wav files, or all in one text file formatted as ( filename "insert line here" ).
I was trying to do this myself but I've already spent 4 hours on it and have gotten only a small piece of Python code that doesn't work.
from bs4 import BeautifulSoup
import re
import urllib.request
soup = BeautifulSoup(urllib.request.urlopen("http://theportalwiki.com/wiki/GLaDOS_voice_lines"), "html.parser")
tags = soup.find_all('i')
f = open('Lines.txt', 'w')
for t in range(len(tags)):
f.write(tags[t] + '\n')
f.close()
It returns "TypeError: unsupported operand type(s) for +: 'Tag' and 'str'."
I also tried AutoHotKey.
^g::
IEGet(Name="") ;Retrieve pointer to existing IE window/tab
{
IfEqual, Name,, WinGetTitle, Name, ahk_class IEFrame
Name := ( Name="New Tab - Windows Internet Explorer" ) ? "about:Tabs"
: RegExReplace( Name, " - (Windows|Microsoft) Internet Explorer" )
For wb in ComObjCreate( "Shell.Application" ).Windows
If ( wb.LocationName = Name ) && InStr( wb.FullName, "iexplore.exe" )
Return wb
} ;written by Jethrow
wb := IEGet()
IELoad(wb) ;You need to send the IE handle to the function unless you define it as global.
{
If !wb ;If wb is not a valid pointer then quit
Return False
Loop ;Otherwise sleep for .1 seconds untill the page starts loading
Sleep,100
Until (wb.busy)
Loop ;Once it starts loading wait until completes
Sleep,100
Until (!wb.busy)
Loop ;optional check to wait for the page to completely load
Sleep,100
Until (wb.Document.Readystate = "Complete")
Return True
}
For IE in ComObjCreate("Shell.Application").Windows ; for each open window
If InStr(IE.FullName, "iexplore.exe") ; check if it's an ie window
break ; keep that window's handle
; this assumes an ie window is available. it won't work if not
IE.Navigate("http://theportalwiki.com/wiki/GLaDOS_voice_lines")
While IE.Busy
Sleep, 100
Links := IE.Document.Links
Inner := FileOpen("C:\Users\Johnson\Desktop\GLaDOS Voice", "w")
Rows := IE.Document.All.Tags("table")[4].Rows
Loop % Rows.Length
Inner.Write(Row[A_Index].InnerText . "`r`n")
Inner.Close()
Return
As far as I can tell, the AutoHotKey script does absolutely nothing. I use the hotkey and nothing happens.
I'd prefer Lua because it's consistent and I understand it.
Your Python code is very close to working. Minor fix (plus using a context manager for the file) below:
from bs4 import BeautifulSoup
import urllib.request
soup = BeautifulSoup(urllib.request.urlopen("http://theportalwiki.com/wiki/GLaDOS_voice_lines"), "html.parser")
tags = soup.find_all('i')
with open('Lines.txt', 'w') as f:
for t in range(len(tags)):
f.write(tags[t].text.strip('“”') + '\n')
Lines.txt:
You just have to look at things objectively, see what you don't need anymore, and trim out the fat.
Portal
Portal 2
Hello and, again, welcome to the Aperture Science computer-aided enrichment center.
...
EDIT
To answer the question in the comment below, this should get the download links:
from bs4 import BeautifulSoup
import urllib.request
soup = BeautifulSoup(urllib.request.urlopen("http://theportalwiki.com/wiki/GLaDOS_voice_lines"), "html.parser")
tags = soup.find_all('a')
with open('Downloads.txt', 'w') as f:
for tag in tags:
if tag.text == 'Download':
f.write(tag['href'] + '\n')
Downloads.txt:
http://i1.theportalwiki.net/img/e/e5/GLaDOS_00_part1_entry-1.wav
http://i1.theportalwiki.net/img/d/d7/GLaDOS_00_part1_entry-2.wav
http://i1.theportalwiki.net/img/5/50/GLaDOS_00_part1_entry-3.wav
...