I have a question regarding saving email data in batches using exchangelib. Currently it is taking a lot of time if there are many emails. After a few minutes it throws this error:
ERROR: MemoryError:
Retry: 0
Waited: 10
Timeout: 120
Session: 25999
Thread: 28148
Auth type: <requests.auth.HTTPBasicAuth object at 0x1FBFF1F0>
URL: https://outlook.office365.com/EWS/Exchange.asmx
HTTP adapter: <requests.adapters.HTTPAdapter object at 0x1792CE68>
Allow redirects: False
Streaming: False
Response time: 411.93799999996554
Status code: 503
Request headers: {'X-AnchorMailbox': 'myworkemail@workdomain.com'}
Response headers: {}
Here is the code that I use for connecting and reading:
def connect_mail():
config = Configuration(
server="outlook.office365.com",
credentials=Credentials(
username="myworkemail@workdomain.com", password="*******"
),
)
return Account(
primary_smtp_address="myworkemail@workdomain.com",
config=config,
access_type=DELEGATE,
)
def import_email(account):
tz = EWSTimeZone.localzone()
start = EWSDateTime(2020, 10, 26, 22, 15, tzinfo=tz)
for item in account.inbox.filter(
datetime_received__gt=start, is_read=False
).order_by("-datetime_received"):
email_body = item.body
email_subject = item.subject
soup = bs(email_body, "html.parser")
tables = soup.find_all("table")
item.is_read = True
item.save()
# Some code here for saving the email to a database
You're getting a MemoryError
which means that Python is not able to allocate any more memory on your machine.
There's a couple of things you can do to reduce memory consumption of your script. One is to use .iterator() which disables internal caching of your query results. Another is to fetch only the fields you actually need using .only()
When you're using .only()
, the other fields will be None
. You need to remember to only save the one field you actually changed: item.save(update_fields=['is_read'])
Here's an example of how to use the two improvements:
for item in account.inbox.filter(
datetime_received__gt=start, is_read=False,
).only(
'is_read', 'subject', 'body',
).order_by('-datetime_received').iterator():