I have a small csv file that looks like that :
is_employee,candidate_id,gender,hesa_type,university
FALSE,b9bb80,Male,Mathematical sciences,Birmingham
FALSE,8e552d,Female,Computer science,Swansea
TRUE,2bc475,Male,Engineering & technology,Aston
TRUE,c3ac8d,Female,Mathematical sciences,Heriot-Watt
FALSE,ceb2fa,Female,Mathematical sciences,Imperial College London
The following lambda function is used to query from an s3bucket.
import boto3
import os
import json
def lambda_handler(event, context):
BUCKET_NAME = 'foo'
KEY = 'bar/data.csv'
s3 = boto3.client('s3','eu-west-1')
response = s3.select_object_content(
Bucket = BUCKET_NAME,
Key = KEY,
ExpressionType = 'SQL',
Expression = 'Select count(*) from s3object s where s.gender like \'%Female%\'',
InputSerialization = {'CSV': {"FileHeaderInfo": "Use"}},
OutputSerialization = {'JSON': {}},
)
for i in response['Payload']:
if 'Records' in i:
query_result = i['Records']['Payload'].decode('utf-8')
print(list(json.loads(query_result).values())[0])
Now, this works great as I get back a result of 3
.
But for some reason the same code does not work when changing the like
operator to =
, results drop down to 0
, so no match found. What's happening here ?
So I found the problem. The problem was that the items of the last column were followed by a newline character, which was not understood by the AWS S3 interpreter. So really, a university name was not Swansea
, but more Swansea\n
.
So s.university = \'Swansea\''
does not work; however, s.university LIKE \'Swansea%\''
does work, and is still a sargable expression.