Search code examples
pythonpython-2.7urllib2

Search for a word on webpage


i found some posts aobut this subject, tried them, but can't get them to work.

  • I need to create a script that takes 2 command-line arguments: inputfile and outputfile.
  • The inputfile is in the filesystem, with content: url, word(s) per line.
  • I then want to open url and search for word(s) it has after comma.
  • After that i want to save the result to a [] and append 'YES' or 'NO' if the word(s) were found.
  • That [] should be written and saved to a outputfile.

My code is:

#!/usr/bin/python
# -*- coding: utf-8 -*-

# Tested Python version: 2.7.12
#
# Run "./script.py [inputfile.txt] [outputfile.txt]"
#
# Exit codes:
# 1 - Python version not tested
# 2 - Wrong number command-line arguments
# 3 - Input file, with this name, does not exist
# 4 - Output file, with this name, already exists
# 5 - Problem with input file
# 6 - Problem with output file

import os, sys
import urllib2, re

# Check python version
req_version = (2, 7)

if not sys.version_info[:2] == req_version:
     print '...'
     print 'Not tested Python version (2.7).'
     print 'Your Python version: ', sys.version_info[:2]
     print '...'
     sys.exit(1)

# Check command-line arguments
if len(sys.argv) < 3:
     print '...'
     print 'Missing command-line argument(s).'
     print 'Argument list:', str(sys.argv)
     print '...'
     sys.exit(2)

# Check if files exist
if not os.path.exists(sys.argv[1]):
     print '...'
     print 'Input file %s was not found.' % sys.argv[1]
     print '...'
     sys.exit(3)

if os.path.exists(sys.argv[2]):
     print '---'
     print 'Output file %s already exists.' % sys.argv[2]
     print '---'
     sys.exit(4)

# Read input file line by line, make a list of URL-s and write the
# results to output file
inputfile = sys.argv[1]
outputfile = sys.argv[2]

print '---'
print 'Reading input file %s ..'  % inputfile
print '---'

results = []

try:
     with open(inputfile, 'r') as in_f:

         for line in in_f:

             url = line.strip().split(',')[0]
             word = line.strip().split(',')[1]
             site = urllib2.urlopen(url).read()

             print 'Found "%s" on "%s" ->' % (word, url)

             # matches = re.search(word)
             # if re.search(word, url):
             # if len(matches) == 0:
             if site.find(word) != -1:
                 print 'YES'
                 results.append('.'.join(url, word + ' YES')))
             else:
                 print 'NO'
                 results.append('.'.join(url, word + ' NO')))
except:
     print 'Error reading the file'
     sys.exit(5)

#if not inputfile.closed:
#     inputfile.close()
print '>>>' + inputfile + ' closed: ' + inputfile.closed

print '...'
print 'Writing results to output file %s ..' % outputfile
print '...'

try:
     with open(outputfile, 'w'):
         for item in results:
             outputfile.write((results) + '\n')
             print '>>>' + outputfile.read()
except:
     print 'Error writing to file'
     sys.exit(6)

#if not outputfile.closed:
#     outputfile.close()
print '>>>' + outputfile + ' closed: ' + outputfile.closed

print ''
print '>>> End of script <<<'
print ''

When i run ./script.py inputfile_name.txt outputfile_name.txt, i get except in terminal from reading inputfile:

...
Reading input file inputfile_name txt ..
...
Error reading the file

Could somebody please point out the possible fault in my code. Can't figure it out.

EDIT: moved the variables (url, word, site) under 'for' block and added print after. The script does print first line of url, word but does not print the "Found ...." % word, url after that. If i remove the print url, word then the script gives except error right away.
EDIT2: made changes as suggested by user Oluwafemi Sule. The script works until the inputfile has multiple words after url (sentence), then it gives except.


Solution

  • The error in your code is from appending to results list with an incorrect number of arguments.

    results.append(url, word + ' YES')
    

    can be written as appending a joined string of url, word and verdict delimited by ,:

    results.append(','.join((url, word, 'YES')))
    

    BONUS:

    Things that can change in your code

    The following code block:

    url = line.strip().split(',')[0]
    word = line.strip().split(',')[1]
    

    can be rewritten as:

    url, word = line.strip().split(',') 
    

    to save from splitting line twice

    The following blocks can be removed as context managers handle file closing implicitly.

    if not inputfile.closed:
         inputfile.close()
    print '>>>' + inputfile + ' closed: ' + inputfile.closed
    

    And

    if not outputfile.closed:
         outputfile.close()
    print '>>>' + outputfile + ' closed: ' + outputfile.closed
    

    Lastly, out_f isn't being written to. That's a potential AttributeError calling write on a string.