python pandas web-scraping screen-scraping

Scrape web with a query

I am trying to scrape impact factors of journals from a particular website or entire web. I have been searching for something close but hard luck..

This is the first time I am trying web scrape with python. I am trying to find the simplest way.

I have a list of ISSN numbers belong to Journals and I want to retrieve the impact factor values of them from web or a particular site. The list has more than 50K values so manually searching the values is practically hard .

Input type

Index,JOURNALNAME,ISSN,Impact Factor 2015,URL,ABBV,SUBJECT
1,4OR-A Quarterly Journal of Operations Research,1619-4500,,,4OR Q J OPER RES,Management Science
2,Aaohn Journal,0891-0162,,,AAOHN J,
3,Aapg Bulletin,0149-1423,,,AAPG BULL,Engineering
4,AAPS Journal,1550-7416,,,AAPS J,Medicine
5,Aaps Pharmscitech,1530-9932,,,AAPS PHARMSCITECH,
6,Aatcc Review,1532-8813,,,AATCC REV,
7,Abdominal Imaging,0942-8925,,,ABDOM IMAGING,
8,Abhandlungen Aus Dem Mathematischen Seminar Der Universitat Hamburg,0025-5858,,,ABH MATH SEM HAMBURG,
9,Abstract and Applied Analysis,1085-3375,,,ABSTR APPL ANAL,Math
10,Academic Emergency Medicine,1069-6563,,,ACAD EMERG MED,Medicine

What is needed ?

The input above has a column of ISSN numbers. Read the ISSN numbers and search for it in researchgate.net or in web. Then wen the individual web pages are found search for Impact Factor 2015 and retrieve the value put it in the empty place beside ISSN Number and also place the retrieved URL next to it

so that web search can be also limited to one site and one keyword search for the value .. the empty one can be kept as "NAN"

Thanks in advance for the suggestions and help

Solution

Try this code using beautiful soup and urllib2. I am using h2 tag and searching for 'Journal Impact:', but I will let you decide on the algorithm to extract the data. The html content is present in soup and soup provides API to extract it. What I provide is an example and that may work for you.

#!/usr/bin/env python

import urllib2
from bs4 import BeautifulSoup

issn = '0219-5305'
url  = 'https://www.researchgate.net/journal/%s_Analysis_and_Applications' % (issn)
htmlDoc = urllib2.urlopen(url).read()
soup    = BeautifulSoup(htmlDoc, 'html.parser')
for tag in soup.find_all('h2'):
    if 'Journal Impact:' in tag.text:
        value = tag.text
        value = value.replace('Journal Impact:', '')
        value = value.strip(' *')
        print value

Output:

   1.13

I think the official documentation for beautiful soup is pretty good. I will suggest spending an hour on the documentation if you are new to this, before even try to write some code. That hour spent on reading the documentation will save you lot more hours later.

https://www.crummy.com/software/BeautifulSoup/ https://www.crummy.com/software/BeautifulSoup/bs4/doc/