Make killable threads more robust to race condition#740
Conversation
There is a very small chance that a killable thread will end on it's own before it is killed. Make us robust to this race condition by not raising an error if the thread is no longer alive.
|
Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please visit https://cla.developers.google.com/ to sign. Once you've signed (or fixed any issues), please reply here (e.g. What to do if you already signed the CLAIndividual signers
Corporate signers
|
|
I signed it! |
openhtf/util/threads.py
Outdated
| @@ -95,7 +95,7 @@ def async_raise(self, exc_type): | |||
| assert self.is_alive(), 'Only running threads have a thread identity' | |||
There was a problem hiding this comment.
Makes me wonder if this assert belongs here. Maybe assert self.ident is not None is appropriate, seems like that's actually guaranteed to be true, and not subject to a race condition.
assert self.ident is not None, 'Thread is missing identifier since it was never started.'
| ctypes.c_long(self.ident), ctypes.py_object(exc_type)) | ||
| if result == 0: | ||
| if result == 0 and self.is_alive(): | ||
| raise ValueError('Thread ID was invalid.', self.ident) |
There was a problem hiding this comment.
Some interesting notes in the documentation of PyThreadState_SetAsyncExc:
To prevent naive misuse, you must write your own C extension to call this. Must be called with the GIL held.
I wonder what's the worst that could happen? Your change already seems like an improvement--I can't tell in what situation this ValueError would ever be raised anymore. The only loophole I've thought of is that since thread identifiers can be “recycled,” I wonder if there's a very small chance that we end up killing some arbitrary other thread that was not meant to be killed.
There was a problem hiding this comment.
Yeah, I think the C extension that holds the GIL is the only way to completely eliminate possibility of a race condition here. I don't want to take that on now though but did add a note in the class docstring.
From documentation of threading.Thread on ident:
The ‘thread identifier’ of this thread or None if the thread has not been started. This is a nonzero integer. See the thread.get_ident() function. Thread identifiers may be recycled when a thread exits and another thread is created. The identifier is available even after the thread has exited.
So the assertion message is a little misleading. Since it's not running threads but started threads that have an ident. I changed the message.
I also put a check on is_alive before calling PyThreadState_SetAsyncExc. This might be a bit of overkill considering we also check on result == 0 and self.is_alive(), but it seems to me that it would be better to not call at all if the thread is not alive.
There was a problem hiding this comment.
The GIL is held when running Python code, if you want to ensure that it's held throughout a function, you can temporarily set the check interval (sys.setcheckinterval) to a very large number. That will ensure that Python doesn't even attempt to switch out the GIL, since the check interval is how often the interpreter checks things, like if other threads are ready to run.
https://docs.python.org/2/c-api/init.html#thread-state-and-the-global-interpreter-lock
There was a problem hiding this comment.
As far as I can tell, @fahhem's solution should work. But I know there are also some concerns about using sys.setcheckinterval in this way. This article, A Neat Python Hack? No, Broken Code, points out that code in the critical section may be releasing the GIL unbeknownst to you. This doesn't seem to be an issue in our case, if we're completely sure about what ctypes.pythonapi.PyThreadState_SetAsyncExc is doing (which I'm not).
Pending an airtight solution, I'm fine with @kdsudac's existing improvement, which I believe leaves only the remote, probably-never-gonna-happen possibility of killing the wrong thread
| if result == 0 and self.is_alive(): | ||
| raise ValueError('Thread ID was invalid.', self.ident) | ||
| elif result != 1: | ||
| # Something bad happened, call with a NULL exception to undo. |
There was a problem hiding this comment.
From the PyThreadState_SetAsyncExc documentation:
Returns the number of thread states modified; this is normally one, but will be zero if the thread id isn’t found.
Not sure when this would happen. But SystemError doesn't seem correct. Maybe RuntimeError?
There was a problem hiding this comment.
Sure changed it to a RuntimeError.
Also noticed that result=0 would go into the elif, so change to greater than.
|
CLAs look good, thanks! |
openhtf/util/threads.py
Outdated
| Based on recipe available at http://tomerfiliba.com/recipes/Thread2/ | ||
|
|
||
| Important Note: | ||
| From PyThreadState_SetAsyncExc documentation: "To prevent naive misuse, |
There was a problem hiding this comment.
The naive misuse comment here was likely from before the ctypes.pythonapi module was created, or it was an intentional red herring. The GIL is held when running Python, so what they really should have said is that the GIL must be held from when the thread is checked to when it's async-raised, which setcheckinterval can fix.
There was a problem hiding this comment.
Let's be more specific in this note, because as Farz points out, it's not all relevant/correct. I'd say something like:
Note: To fully address race conditions involved with the use of PyThreadState_SetAsyncExc, the GIL must be held from when the thread is checked to when it's async-raised. In this case, we're not doing that, and there remains the remote possibility that a thread identifier is reused and we accidentally kill the wrong thread.
openhtf/util/threads.py
Outdated
| """Raise the exception.""" | ||
| assert self.is_alive(), 'Only running threads have a thread identity' | ||
| # Should only be called on a started thread so raise otherwise | ||
| assert self.ident is not None, 'Only started threads have a thread identity' |
There was a problem hiding this comment.
“identity” --> “identifier”
| # If the thread has died we don't want to raise an exception so log. | ||
| if not self.is_alive(): | ||
| _LOG.debug('Not raising %s because thread %s (%s) is not alive', | ||
| exc_type, self.name, self.ident) |
There was a problem hiding this comment.
needs return here
I don't think we should print self.ident since as far as the documentation indicates it's a Python-specific magic cookie to be used e.g. to index a dictionary of thread-specific data. I notice this is the only file where it is used. self.name should be sufficient.
There was a problem hiding this comment.
it's printed out in case it is special or odd. I think it can be None for threads that aren't started yet
There was a problem hiding this comment.
The none-check is on the previous line, I'm not sure if there are other odd cases
There was a problem hiding this comment.
Added return.
As farz pointed out, including ident could potentially be useful debugging information and there is no harm to including it at debug level.
| # Should only be called on a started thread so raise otherwise | ||
| assert self.ident is not None, 'Only started threads have a thread identity' | ||
|
|
||
| # If the thread has died we don't want to raise an exception so log. |
There was a problem hiding this comment.
I don't think calling PyThreadState_SetAsyncExc on a thread that is already dead is an issue—it seems to happily return 0 in that case.
The reason I can see why we might want to make this check is to minimize the chance of accidentally killing some other thread, so that's what I would mention in the comment.
There was a problem hiding this comment.
FYI, idents are large (64 bit ints on 64-bit machines) so rollover/reuse is unlikely. You'd have to spawn 2^64 threads, and kill 2^(50+) before it rolls over since you can't have that many running anyway
There was a problem hiding this comment.
for i in xrange(10):
t = threading.Thread(target=infinite_loop)
t.setDaemon(1)
t.start()
print t.ident
Returns monotonically increasing IDs, spanning a range of 38 million, for just 10 threads. I agree it's unlikely, hopefully virtually impossible, but not sure about the implementation details.
| result = ctypes.pythonapi.PyThreadState_SetAsyncExc( | ||
| ctypes.c_long(self.ident), ctypes.py_object(exc_type)) | ||
| if result == 0: | ||
| if result == 0 and self.is_alive(): |
There was a problem hiding this comment.
Seems like you actually got bit by this before so I'd leave a comment here so it's emphasized that this isn't redundant with the earlier check: # Do not raise an error unnecessarily if the thread is dead.
| ctypes.c_long(self.ident), ctypes.py_object(exc_type)) | ||
| if result == 0: | ||
| if result == 0 and self.is_alive(): | ||
| raise ValueError('Thread ID was invalid.', self.ident) |
There was a problem hiding this comment.
As far as I can tell, @fahhem's solution should work. But I know there are also some concerns about using sys.setcheckinterval in this way. This article, A Neat Python Hack? No, Broken Code, points out that code in the critical section may be releasing the GIL unbeknownst to you. This doesn't seem to be an issue in our case, if we're completely sure about what ctypes.pythonapi.PyThreadState_SetAsyncExc is doing (which I'm not).
Pending an airtight solution, I'm fine with @kdsudac's existing improvement, which I believe leaves only the remote, probably-never-gonna-happen possibility of killing the wrong thread
openhtf/util/threads.py
Outdated
| ctypes.pythonapi.PyThreadState_SetAsyncExc(self.ident, None) | ||
| raise SystemError('PyThreadState_SetAsyncExc failed.', self.ident) | ||
| raise RuntimeError('PyThreadState_SetAsyncExc %s failed: %s', | ||
| self.ident, result) |
There was a problem hiding this comment.
Technically should be something like below, since the command didn't “fail”
'Error: PyThreadState_SetAsyncExc on %s affected %d threads.' % (self.name, result)No one really knows when this case is reached.
As noted above I don't think we should print self.ident
There was a problem hiding this comment.
Used your text, I'm including all the information we have, thread name, exc_type, thread name and result. These failure conditions are rare so somewhat hard to replicate so we should acquire as much debugging info as possible.
openhtf/util/threads.py
Outdated
| Based on recipe available at http://tomerfiliba.com/recipes/Thread2/ | ||
|
|
||
| Important Note: | ||
| From PyThreadState_SetAsyncExc documentation: "To prevent naive misuse, |
There was a problem hiding this comment.
Let's be more specific in this note, because as Farz points out, it's not all relevant/correct. I'd say something like:
Note: To fully address race conditions involved with the use of PyThreadState_SetAsyncExc, the GIL must be held from when the thread is checked to when it's async-raised. In this case, we're not doing that, and there remains the remote possibility that a thread identifier is reused and we accidentally kill the wrong thread.
|
Believe, I addressed all of your concerns. Main goal of this PR is to 1) make KillableThreads more robust to case of an exception being raised in a dead thread; 2) Get additional debugging information for those rare cases where a problem is detected. Follow-up work to improve or document limitations on KillableThreads may be necessary. |
There is a very small chance that a killable thread will end on it's own before it is killed. Make us robust to this race condition by not raising an error if the thread is no longer alive.
This change is