Deadlock in IvasModeRunner.py
Commit SHA to replicate: ca31cb52 (for example)
It has been found out in the development of the CI system and specifically smoke tests that the IvasModeRunner.py can end up into a deadlock situation. In previous use, this has been probably a very rarely occurring situation but with the short inputs of the smoke tests combined with high concurrency (over 50 threads) on some runners has brought the deadlock up.
This problem can be replicated at least on Ubuntu 18.04 using Python version 3.8.10 and with the following edited command of smoke_test.sh
./scripts/runIvasCodec.py -p ./scripts/config/ci_linux.json -m $list -U 1 -t 200 | tee smoke_test_output.txt
Probably increasing threads from 200 to higher count makes it appear even more reliably but this already ended in deadlock almost always on our platform. Our default thread count of 16 required somewhere between 50 to 100 runs to end in deadlock.
The problem seems to be limited to the normal encoder-decoder run as the run with --decoder_only
option did not deadlock.
A trace with GDB from the deadlock situation gives this
(gdb) py-bt
Traceback (most recent call first):
File "/usr/lib/python3.8/threading.py", line 1027, in _wait_for_tstate_lock
elif lock.acquire(block, timeout):
File "/usr/lib/python3.8/threading.py", line 1011, in join
self._wait_for_tstate_lock()
File "/usr/lib/python3.8/concurrent/futures/thread.py", line 236, in shutdown
t.join()
File "/usr/lib/python3.8/concurrent/futures/_base.py", line 644, in __exit__
self.shutdown(wait=True)
File "/home/gitlab-runner/debugtemp/ivas-pc-repo/scripts/pyivastest/IvasModeRunner.py", line 2632, in run_enc_dec_threads
File "/home/gitlab-runner/debugtemp/ivas-pc-repo/scripts/pyivastest/IvasModeRunner.py", line 1513, in run
self.run_enc_dec_threads()
File "./scripts/runIvasCodec.py", line 908, in run
File "./scripts/runIvasCodec.py", line 661, in <module>
and interrupting the process gives this trace
File "./scripts/runIvasCodec.py", line 149, in <module>
script.run()
File "./scripts/runIvasCodec.py", line 140, in run
runner.run()
File "/home/gitlab-runner/debugtemp/ivas-pc-repo/scripts/pyivastest/IvasModeRunner.py", line 1513, in run
self.run_enc_dec_threads()
File "/home/gitlab-runner/debugtemp/ivas-pc-repo/scripts/pyivastest/IvasModeRunner.py", line 1608, in run_enc_dec_threads
self.dec_queue["condition"].release()
File "/usr/lib/python3.8/concurrent/futures/_base.py", line 644, in __exit__
self.shutdown(wait=True)
File "/usr/lib/python3.8/concurrent/futures/thread.py", line 236, in shutdown
t.join()
File "/usr/lib/python3.8/threading.py", line 1011, in join
self._wait_for_tstate_lock()
File "/usr/lib/python3.8/threading.py", line 1027, in _wait_for_tstate_lock
elif lock.acquire(block, timeout):
Looking at the code, this deadlock problem looks to be related to the lock self.dec_queue["condition"]
acquired at row 1596 of IvasModeRunner.py
and the break
on row 1600 of the same file. I would guess that there should be at least a release of the lock there. However, testing addition of the release gives a wrong progress output so some more work is required to fix this properly.