Search code examples
pythonpandasgoogle-cloud-storagegoogle-compute-enginegoogle-natural-language

Too many open files: '/home/USER/PATH/SERVICE_ACCOUNT.json' when calling Google's Natural Language API


I'm working on a Sentiment Analysis project using the Google Cloud Natural Language API and Python, this question might be similar to this other question, what I'm doing is the following:

  1. Reads a CSV file from Google Cloud Storage, file has approximately 7000 records.
  2. Converts the CSV into a Pandas DataFrame.
  3. Iterates over the dataframe and calls the Natural Language API to perform sentiment analysis on one of the dataframe's columns, on the same for loop I extract the score and magnitude from the result and add those values to a new column on the dataframe.
  4. Store the result dataframe back to GCS.

I'll put my code below, but prior to that, I just want to mention that I have tested it with a sample CSV with less than 100 records and it works well, I am also aware about the quota limit of 600 requests per minute, reason why I put a delay on each iteration, still, I'm getting the error I specify at the title. I'm also aware about the suggestion of increasing the ulimit, but I don't think that's a good solution.

Here's my code:

from google.cloud import language_v1
from google.cloud.language_v1 import enums
from google.cloud import storage
from time import sleep
import pandas
import sys

pandas.options.mode.chained_assignment = None

def parse_csv_from_gcs(csv_file):
    df = pandas.read_csv(f, encoding = "ISO-8859-1")

    return df

def analyze_sentiment(text_content):
    client = language_v1.LanguageServiceClient()
    type_ = enums.Document.Type.PLAIN_TEXT
    language = 'es'
    document = {"content": text_content, "type": type_, "language": language}
    encoding_type = enums.EncodingType.UTF8
    response = client.analyze_sentiment(document, encoding_type=encoding_type)

    return response

gcs_path = sys.argv[1]
output_bucket = sys.argv[2]
output_csv_file = sys.argv[3]

dataframe = parse_csv_from_gcs(gcs_path)

for i in dataframe.index:
    print(i)
    response = analyze_sentiment(dataframe.at[i, 'FieldOfInterest'])
    dataframe.at[i, 'Score'] = response.document_sentiment.score
    dataframe.at[i, 'Magnitude'] = response.document_sentiment.magnitude
    sleep(0.5)

print(dataframe)
dataframe.to_csv("results.csv", encoding = 'ISO-8859-1')

gcs = storage.Client()
gcs.get_bucket(output_bucket).blob(output_csv_file).upload_from_filename('results.csv', content_type='text/csv')

The 'analyze_sentiment' function is very similar to what we have in Google's documentation, I just modified it a little, but it does pretty much the same thing.

Now, the program is raising that error and crashes when it reaches a record between 550 and 700, but I don't see the correlation between the service account JSON and calling the Natural Language API, so I also think that when I call the the API, it opens the account credential JSON file but doesn't close it afterwards.

I'm currently stuck with this issue and ran out of ideas, so any help will be much appreciated, thanks in advance =)!

[UPDATE]

I've solved this issue by extracting the 'client' out of the 'analyze_sentiment' method and passing it as a parameter, as follows:

def analyze_sentiment(ext_content, client):
    <Code>    

Looks like every time it reaches this line:

client = language_v1.languageServiceClient()

It opens the account credential JSON file and it doesn't get closed, so extracting it to a global variable made this work =).


Solution

  • I've updated the original post with the solution for this, but in any case, thanks to everyone that saw this and tried to reply =)!