I would like to access the ScopusSearch API and obtain the EIDs of a list of 1400 article titles that are saved in an excel spreadsheet. I tried to retrieve the EIDs via the following code:
import numpy as np
import pandas as pd
from pybliometrics.scopus import ScopusSearch
nan = pd.read_excel(r'C:\Users\Apples\Desktop\test\titles_nan.xlsx', sheet_name='nan')
error_index = {}
for i in range(0,len(nan)):
scopus_title = nan.loc[i ,'Title']
s = ScopusSearch('TITLE("{0}")'.format(scopus_title))
print('TITLE("{0}")'.format(scopus_title))
try:
s = ScopusSearch(scopus_title)
nan.at[i,'EID'] = s.results[0].eid
print(str(i) + ' ' + s.results[0].eid)
except:
nan.loc[i,'EID'] = np.nan
error_index[i] = scopus_title
print(str(i) + 'error' )
However, I was never able to retrieve the EIDs beyond 100 titles (approximately) because certain titles yield far too many searches and that stalls the entire process.
As such, I wanted to skip titles that contain too many searches and move on to the next title, all while keeping a record of the titles that were skipped.
I am just starting out with Python so I am not sure how to go about doing this. I have the following sequence in mind:
• If the title yields 1 search, retrieve the EID and record it under the ‘EID’ column of file ‘nan’.
• If the title yields more than 1 search, record the title in the error index, print ‘Too many searches’ and move on to the next search.
• If the title does not yield any searches, record the title in the error index, print ‘Error’ and move on to the next search.
Attempt 1
for i in range(0,len(nan)):
scopus_title = nan.at[i ,'Title']
print('TITLE("{0}")'.format(scopus_title))
s = ScopusSearch('TITLE("{0}")'.format(scopus_title))
print(type(s))
if(s.count()== 1):
nan.at[i,"EID"] = s.results[0].eid
print(str(i) + " " + s.results[0].eid)
elif(s.count()>1):
continue
print(str(i) + " " + "Too many searches")
else:
error_index[i] = scopus_title
print(str(i) + "error")
Attempt 2
for i in range(0,len(nan)):
scopus_title = nan.at[i ,'Title']<br/>
print('TITLE("{0}")'.format(scopus_title))<br/>
s = ScopusSearch('TITLE("{0}")'.format(scopus_title))
if len(s.results)== 1:
nan.at[i,"EID"] = s.results[0].eid
print(str(i) + " " + s.results[0].eid)
elif len(s.results)>1:
continue
print(str(i) + " " + "Too many searches")
else:
continue
print(str(i) + " " + "Error")
I got errors stating that object of type 'ScopusSearch' has no len() /count() or the searches or not a list themselves. I am unable to proceed from here. In addition, I am not sure if this is the right way to go about it – skipping titles based on too many searches. Are there more effective methods (e.g. timeouts – skip the title after a certain amount of time is spent on the search).
Any help on this matter would be very much appreciated. Thank you!
Combine .get_results_size()
with download=False
:
from pybliometrics.scopus import ScopusSearch
scopus_title = "Editorial"
q = f'TITLE("{scopus_title}")' # this is f-string notation, btw
s = ScopusSearch(q, download=False)
s.get_results_size()
# 243142
if this number is below a certain threshold, simply do s = ScopusSearch(q)
and proceed as in "Attempt 2":
for i, row in nan.iterrows():
q = f'TITLE("{row['Title']}")'
print(q)
s = ScopusSearch(q, download=False)
n = s.get_results_size()
if n == 1:
s = ScopusSearch(q)
nan.at[i,"EID"] = s.results[0].eid
print(f"{i} s.results[0].eid")
elif n > 1:
print(f"{i} Too many results")
continue # must come last
else:
print(f"{i} Error")
continue # must come last
(I used the .iterrows()
here to get rid of the indexation. But the i
will be incorrect if the index is not a range sequence - in this case enclose all in enumerate()
.)