I am working on a Django app whose partial process is transcribing audio with timestamps. When a user clicks on a button from a web interface, the Django server launches a Python script that helps with transcribing.
Now, here are a few approaches I have tried already: I have a separate transcribe.py file. When a user clicks the transcribe button from the web page, it accesses a view from the project app. However, after partially running the script, the Django server terminates from the terminal.
Since the Python script is a long-running process, I figured I should run the program in the background so the Django server doesn't terminate. So, I implemented Celery and Redis. First, the transcribe.py script runs perfectly well when I run it from the Django shell. However, it terminates once again when I try to execute it from the view/web page.
python manage.py shell
Since I implemented the celery worker part, the server doesn't terminate but the worker throws the following error.
[tasks]
. transcribeApp.tasks.run_transcription
[2024-11-25 03:26:04,500: INFO/MainProcess] Connected to redis://localhost:6379/0
[2024-11-25 03:26:04,514: INFO/MainProcess] mingle: searching for neighbors
[2024-11-25 03:26:05,520: INFO/MainProcess] mingle: all alone
[2024-11-25 03:26:05,544: INFO/MainProcess] [email protected] ready.
[2024-11-25 03:26:16,253: INFO/MainProcess] Task searchApp.tasks.run_transcription[c684bdfa-ec21-4b4e-9542-0ca1f7729682] received
[2024-11-25 03:26:16,255: INFO/ForkPoolWorker-15] Starting transcription process.
[2024-11-25 03:26:16,509: WARNING/ForkPoolWorker-15] /Users/user/Desktop/project/django_app/django_venv/lib/python3.12/site-packages/whisper/__init__.py:150: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
checkpoint = torch.load(fp, map_location=device)
[2024-11-25 03:26:16,670: ERROR/MainProcess] Process 'ForkPoolWorker-15' pid:38956 exited with 'signal 11 (SIGSEGV)'
[2024-11-25 03:26:16,683: ERROR/MainProcess] Task handler raised error: WorkerLostError('Worker exited prematurely: signal 11 (SIGSEGV) Job: 0.')
Traceback (most recent call last):
File "/Users/user/Desktop/project/django_app/django_venv/lib/python3.12/site-packages/billiard/pool.py", line 1265, in mark_as_worker_lost
raise WorkerLostError(
billiard.einfo.ExceptionWithTraceback:
"""
Traceback (most recent call last):
File "/Users/user/Desktop/project/django_app/django_venv/lib/python3.12/site-packages/billiard/pool.py", line 1265, in mark_as_worker_lost
raise WorkerLostError(
billiard.exceptions.WorkerLostError: Worker exited prematurely: signal 11 (SIGSEGV) Job: 0.
"""
The implementation looks like this,
# Views.py
from . import tasks
from django.shortcuts import render
from django.http import HttpResponse, JsonResponse
def trainVideos(request):
try:
tasks.run_transcription.delay()
return JsonResponse({"status": "success", "message": "Transcription has started check back later."})
# return render(request, 'embed.html', {'data': data})
except Exception as e:
JsonResponse({"status": "error", "message": str(e)})
Here is what the transcribe function looks like, where the celery worker throws the worker exited prematurely error.
# Add one or two audios possibly .wav, .mp3 in a folder,
# and provide the file path here.
# transcribe.py
import whisper_timestamped as whisper
import os
def transcribeTexts(model_id, filePath):
result = []
fileNames = os.listdir()
model = whisper.load_model(model_id)
for files in fileNames:
audioPath = filePath + "/" + files
audio = whisper.load_audio(audioPath)
result.append(model.transcribe(audio, language="en"))
return result
model_id = "tiny"
audioFilePath = path/to/audio
transcribeTexts(model_id, audioFilePath)
Install the following libraries to reproduce the problem:
pip install openai-whisper
pip3 install whisper-timestamped
pip install Django
pip install celery redis
pip install redis-server
The Celery Implementation: # celery.py from project main_app directory
from __future__ import absolute_import, unicode_literals
import os
from celery import Celery
os.environ.setdefault('DJANGO_SETTINGS_MODULE', 'main_app.settings')
app = Celery('main_app')
app.config_from_object('django.conf:settings', namespace='CELERY')
app.autodiscover_tasks()
def debug_tasks(self):
print(f"Request: {self.request!r}")
tasks.py from the transcribe_app directory:
from __future__ import absolute_import, unicode_literals
from . import transcribe
from celery import shared_task
@shared_task
def run_transcription():
transcribe.transcribe()
return "Transcription Completed..."
The settings.py is also updated with the following:
CELERY_BROKER_URL = 'redis://localhost:6379/0'
CELERY_BROKER_CONNECTION_RETRY_ON_STARTUP = True
Also, modified the init.py file from django_app
from __future__ import absolute_import, unicode_literals
from .celery import app as celery_app
__all__ = ('celery_app',)
For this application, some of the libraries are dependent on particular versions. All libraries and packages are listed below:
Package Version
-------------------- -----------
amqp 5.3.1
asgiref 3.8.1
billiard 4.2.1
celery 5.4.0
certifi 2024.8.30
charset-normalizer 3.3.2
click 8.1.7
click-didyoumean 0.3.1
click-plugins 1.1.1
click-repl 0.3.0
Cython 3.0.11
Django 5.1.2
django-widget-tweaks 1.5.0
dtw-python 1.5.3
faiss-cpu 1.9.0
ffmpeg 1.4
filelock 3.16.1
fsspec 2024.9.0
huggingface-hub 0.25.2
idna 3.10
Jinja2 3.1.4
kombu 5.4.2
lfs 0.2
llvmlite 0.43.0
MarkupSafe 3.0.1
more-itertools 10.5.0
mpmath 1.3.0
msgpack 1.1.0
networkx 3.3
numba 0.60.0
numpy 2.0.2
packaging 24.1
panda 0.3.1
pillow 10.4.0
pip 24.3.1
prompt_toolkit 3.0.48
pydub 0.25.1
python-dateutil 2.9.0.post0
PyYAML 6.0.2
redis 5.2.0
regex 2024.9.11
requests 2.32.3
safetensors 0.4.5
scipy 1.14.1
semantic-version 2.10.0
setuptools 75.1.0
setuptools-rust 1.10.2
six 1.16.0
sqlparse 0.5.1
sympy 1.13.3
tiktoken 0.8.0
tokenizers 0.20.1
torch 2.4.1
torchaudio 2.4.1
torchvision 0.19.1
tqdm 4.66.5
transformers 4.45.2
txtai 7.4.0
typing_extensions 4.12.2
tzdata 2024.2
urllib3 2.2.3
vine 5.1.0
wcwidth 0.2.13
whisper-timestamped 1.15.4
Overall, when I run the program independently, it works perfectly fine. But within Django, it just terminates however I execute it. I thought one of the reasons might be since I am loading long audios, so I chunked it and tried to run the transcribe.py program using the user interface; however, it's the same thing worker exited prematurely, signal 11 (SIGSEGV) Job: 0. I tried changing memory pool size to a higher level for a worker, didn't work. I am unsure exactly what needs to be done to run the transcribe.py file within Django since most known methods are not working for me. I may have missed something, so please help me figure this out. Thank you for your time.
sigsegv often comes when you try to access memory that's not accessible by your program, see here. I could re-create the code and it worked completely fine on my end. Here are the probable reasons why this happened to you:
I'll walk you through how I re-created your code, and maybe you made a typo or a little mistake that resulted in the error you mentioned.
django-admin startproject project101
cd project101
python3 manage.py startapp app101
project101/urls.py:
from django.contrib import admin
from django.urls import path, include
urlpatterns = [
path('admin/', admin.site.urls),
path('', include("app101.urls"))
]
project101/settings.py:
INSTALLED_APPS = [
# ...
'app101'
]
# put this at the end of settings.py
CELERY_BROKER_URL = 'redis://localhost:6379/0'
CELERY_BROKER_CONNECTION_RETRY_ON_STARTUP = True
project101/celery.py
from __future__ import absolute_import, unicode_literals
import os
from celery import Celery
os.environ.setdefault('DJANGO_SETTINGS_MODULE', 'project101.settings')
app = Celery('project101')
app.config_from_object('django.conf:settings', namespace='CELERY')
app.autodiscover_tasks()
def debug_tasks(self):
print(f"Request: {self.request!r}")
project101/init.py:
from __future__ import absolute_import, unicode_literals
from .celery import app as celery_app
__all__ = ('celery_app',)
app101/views.py:
from . import tasks
from django.shortcuts import render
from django.http import HttpResponse, JsonResponse
def trainVideos(request):
try:
tasks.run_transcription.delay()
return JsonResponse({"status": "success", "message": "Transcription has started check back later."})
# return render(request, 'embed.html', {'data': data})
except Exception as e:
JsonResponse({"status": "error", "message": str(e)})
app101/urls.py:
from django.urls import path, include
from . import views
urlpatterns = [
path('transcribe', views.trainVideos)
]
app101/tasks.py:
from __future__ import absolute_import, unicode_literals
from . import transcribe
from celery import shared_task
@shared_task
def run_transcription():
transcribe.transcribe()
return "Transcription Completed..."
app101/transcribe.py:
import whisper_timestamped as whisper
import os
def transcribeTexts(model_id, audio_directory_path):
result = []
fileNames = os.listdir(audio_directory_path)
model = whisper.load_model(model_id)
for files in fileNames:
print(files)
audioPath = audio_directory_path + "/" + files
audio = whisper.load_audio(audioPath)
result.append(model.transcribe(audio, language="en"))
print(result)
return result
def transcribe():
model_id = "tiny"
audio_directory_path = 'audio_sample'
transcribeTexts(model_id, audio_directory_path)
Note that audio_sample
is a folder outside app101, it has the same level as app101 and project101. You could make it in another folder but make sure to specify the correct directory path. I've added directory structure below.
.
├── app101
│ ├── admin.py
│ ├── apps.py
│ ├── __init__.py
│ ├── migrations
│ ├── models.py
│ ├── __pycache__
│ ├── tasks.py
│ ├── tests.py
│ ├── transcribe.py
│ ├── urls.py
│ └── views.py
├── audio_sample
│ └── some_audio.mp3
├── db.sqlite3
├── manage.py
└── p101
├── asgi.py
├── celery.py
├── __init__.py
├── __pycache__
├── settings.py
├── urls.py
└── wsgi.py
After this, run the following commands on separate terminals:
python3 manage.py runserver
celery -A project101 worker --pool=solo -l info
This should make your project up and running. To test, send a get request to http://localhost:8000/transcribe
or simply open it in your browser.
Note the following: