I'm using SQLAlchemy and multiprocessing. I also use scoped_session sinse it avoids share the same session but I've found an error and their solution but I don't understand why does it happend.
You can see my code below:
db.py
engine = create_engine(connection_string)
Session = sessionmaker(bind=engine)
DBSession = scoped_session(Session)
script.py
from multiprocessing import Pool, current_process
from db import DBSession
def process_feed(test):
session = DBSession()
print(current_process().name, session)
def run():
session = DBSession()
pool = Pool()
print(current_process().name, session)
pool.map_async(process_feed, [1, 2]).get()
if __name__ == "__main__":
run()
When I run script.py
The output is:
MainProcess <sqlalchemy.orm.session.Session object at 0xb707b14c>
ForkPoolWorker-1 <sqlalchemy.orm.session.Session object at 0xb707b14c>
ForkPoolWorker-2 <sqlalchemy.orm.session.Session object at 0xb707b14c>
Note that session object is the same 0xb707b14c
in the main process and their workers (child process)
BUT If I change the order of first two lines run():
def run():
pool = Pool() # <--- Now pool is instanced in the first line
session = DBSession() # <--- Now session is instanced in the second line
print(current_process().name, session)
pool.map_async(process_feed, [1, 2]).get()
And the I run script.py
again the output is:
MainProcess <sqlalchemy.orm.session.Session object at 0xb66907cc>
ForkPoolWorker-1 <sqlalchemy.orm.session.Session object at 0xb669046c>
ForkPoolWorker-2 <sqlalchemy.orm.session.Session object at 0xb66905ec>
Now the session instances are different.
To understand why this happens, you need to understand what scoped_session
and Pool
actually does. scoped_session
keeps a registry of sessions so that the following happens
DBSession
, it creates a Session
object for you in the registrySession
object and instead returns you the previously created Session
object backWhen you create a Pool
, it creates the workers in the __init__
method. (Note that there's nothing fundamental about starting the worker processes in __init__
. An equally valid implementation could wait until workers are first needed before it starts them, which would exhibit different behavior in your example.) When this happens (on Unix), the parent process forks itself for every worker process, which involves the operating system copying the memory of the current running process into a new process, so you will literally get the exact same objects in the exact same places.
Putting these two together, in the first example you are creating a Session
before forking, which gets copied over to all worker processes during the creation of the Pool
, resulting in the same identity, while in the second example you delay the creation of the Session
object until after the worker processes have started, resulting in different identities.
It's important to note that while the Session
objects share the same id
, they are not the same object, in the sense that if you change anything about the Session
in the parent process, they will not be reflected in the child processes. They just happen to all share the same memory address due to the fork. However, OS-level resources like connections are shared, so if you had run a query on session
before Pool()
, a connection would have been created for you in the connection pool and subsequently forked into the child processes. If you then attempt to perform queries in the child processes you will run into weird errors because your processes are clobbering over each other over the same exact connection!
The above is moot for Windows because Windows does not have fork()
.