-
-
Notifications
You must be signed in to change notification settings - Fork 198
Feature/faster ad tls v4 #1171
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/faster ad tls v4 #1171
Conversation
This reverts commit 883d781.
…ying on compiler optimizations
|
To recap, there's a bit of a discrepancy between the microbenchmark and the end-to-end results. We still don't know the cause of that and it would make sense for us to have some explanation as to why that could possibly happen. It would be ideal if we had some automated process to run the benchmarks across the 3 OSes (2 for this particular PR) to make sure that it is indeed faster. At this level of optimization of code, benchmarking results don't translate across OSes, across different compilers, and even across different compiler versions (including minor). Since the current proposed implementation uses C preprocessor macros, we should try to make it as clear as possible prior to merging. I believe we've all come to the agreement that given the potential improvement to speed, we can deal with the non-standard (relative to Math) code, but we should take maintenance into serious consideration. If you haven't seen our style guide, it's here: Coding Style and Idioms (github wiki). Just FYI, we're doing what the Google Style Guide explicitly says not to do. This is from the Preprocessor Macros section:
And further down:
This code is a lot better than the first and second versions, but I think there are ways we can make it so it's really ok to pull in. If possible, we should try to get the code designed that way. These things tend not to be fixed once in. And that in turn leads to it being difficult for it to be fixed in the future. I'm trying to prevent this bit of code from locking us in and slowing down future development. Especially as we're moving towards parallelization and requiring more fine-grained control over the autodiff stack. |
|
Are we really going down the route of quoting SOPs and all that? I don't hope so! We are all skilled enough and have a good enough sense for what's good and what not. Wrt. to the ifdefs: My understanding was we are fine with the ifdefs we use right now. These should be used sparingly, of course. However, we have in our own codebase a few exceptions to that ( Did you have anything specific in mind for the ifdefs? A simplification I could think of is like Wrt. to the benchmarks: We have to my knowledge a single signal which is the MacPro microbenchmark only. It would be good to understand this case. The information missing so far for me is what OS and what exact CPU is in that machine. I do think that we should try to test only on the latest OS on Mac, Mojave. The rationale is that Mojave should have the best fixes for those Intel bugs Spectre and Meltdown. Moreover, we need a common base to compare things if possible to further reduce the combinatorics. Also: I can't debug this at all! I just don't have such a machine. For reference, the What would ease testing a little bit is to add the warfarin The only design which I haven't yet discussed to my knowledge is the way the life-time of the AD tape is managed. Right now the first instance ever created of Now, what this PR also shows is that the Windows gcc 4.9.3 struggles with a non-pointer design. The behavior I see suggests that not using pointers tickles bugs with this compiler. I don't have a better explanation. Thus going to all-pointer solves this. So here is what is left from my perspective:
@syclik There is one thing weird in the result you report: When you go on linux from 1 to 2 threads the time increases... is that a mistake? It does not make sense to me. Also: What time do you report? We should use the wall-clock time as reported by @seantalts Right now I am getting a test failure (in the |
|
@rok-cesnovar NP... waiting a bit and then merging develop again is something I can handle. |
|
You could also merge this branch in if you wanted to get it testing on Jenkins sooner, shouldn't trigger any merge conflicts or anything. branch is |
|
Thanks! Now I feel like we’re actually moving the discussion to the
technical things we need to iron out.
I’ll respond when I have a chance later today.
…On Fri, Mar 22, 2019 at 5:41 AM wds15 ***@***.***> wrote:
Are we really going down the route of quoting SOPs and all that? I don't
hope so! We are all skilled enough and have a good enough sense for what's
good and what not.
Wrt. to the ifdefs: My understanding was we are fine with the ifdefs we
use right now. These should be used sparingly, of course. However, we have
in our own codebase a few exceptions to that (__WIN32__ or __GNUC__, for
example). All ifdefs we use right have been agreed by us - we anyway do
STAN_THREADS and __GNUC__ was waived as well for good reasons.
Did you have anything specific in mind for the ifdefs? A simplification I
could think of is like
// at the top of autodiffstackstorage we do:
#ifdef STAN_THREADS
#ifdef __GNUC__
#define STAN_THREADS_DEF __thread
#else
#define STAN_THREADS_DEF thread_local
#endif
#else
// empty def for no-threading case
#define STAN_THREADS_DEF
#endif
// then the definitions are done with
static STAN_THREADS_DEF AutodiffStackStorage *instance_;
Wrt. to the benchmarks: We have to my knowledge a single signal which is
the MacPro microbenchmark only. It would be good to understand this case.
The information missing so far for me is what OS and what exact CPU is in
that machine. I do think that we should try to test only on the latest OS
on Mac, Mojave. The rationale is that Mojave should have the best fixes for
those Intel bugs Spectre and Meltdown. Moreover, we need a common base to
compare things if possible to further reduce the combinatorics. Also: I
can't debug this at all! I just don't have such a machine. For reference,
the compare-pointer.sh script gives on my machine with clang++-6.0 from
macports (30 replications):
// develop, no STAN_THREADS
benchmark_autodiff_stack_mean 10349 ns 10346 ns 65401
benchmark_autodiff_stack_median 10305 ns 10304 ns 65401
benchmark_autodiff_stack_stddev 160 ns 157 ns 65401
// faster ad tls v4, no STAN_THREADS
benchmark_autodiff_stack_mean 10852 ns 10847 ns 63979
benchmark_autodiff_stack_median 10756 ns 10753 ns 63979
benchmark_autodiff_stack_stddev 320 ns 313 ns 63979
What would ease testing a little bit is to add the warfarin map_rect
example to the performance benchmark suite, I think. This is a good
end-to-end test for a map_rect program which we don't have yet. @seantalts
<https://github.com/seantalts> Is that an option? Would this help us?
The only design which I haven't yet discussed to my knowledge is the way
the life-time of the AD tape is managed. Right now the first instance ever
created of ChainableStack will own the instance. This will be for the
main process most likely the instance created in
stan/math/rev/core/init_chainablestack.hpp. For child-threads the first
instance of this class created will own the AD tape and thus tear it down.
An alternative solution could be the use of a smart-pointer type of thing.
So one would introduce a ref-count to the AD tape and only delete the
instance once the ref count goes to zero. I went with the simpler approach
as it does the job just fine and we can adapt that at a later stage if
needed as I would see this as a very internal design choice irrelevant to
any user (given we only support threading right now through map_rect).
Now, what this PR also shows is that the Windows gcc 4.9.3 struggles with
a non-pointer design. The behavior I see suggests that not using pointers
tickles bugs with this compiler. I don't have a better explanation. Thus
going to all-pointer solves this.
So here is what is left from my perspective:
1. Understand the MacPro microbenchmark thing - who can handle this? I
don't have one.
2. Potentially improve ifdef logic...one suggestion is above outlined
- comments on the suggestion are welcome? Other ideas?
3. Settle on the design of first ownership vs smart pointer thing -
opinions?
4. Run end-to-end benchmark on the final thing once more - maybe via
performance cmdstan suite?
@syclik <https://github.com/syclik> There is one thing weird in the
result you report: When you go on linux from 1 to 2 threads the time
*increases*... is that a mistake? It does not make sense to me. Also:
What time do you report? We should use the wall-clock time as reported by
time, which is the first line which starts with real from time. The
reported run-time from cmdstan is the total CPU time which is not what we
want.
@seantalts <https://github.com/seantalts> Right now I am getting a test
failure (in the Full unit with GPU) which should not happen and is not
related to this PR, ideas/suggestions?
clang++-6.0 -Werror -std=c++1y -Wno-sign-compare -I lib/opencl_1.2.8 -O3 -I . -I lib/eigen_3.3.3 -I lib/boost_1.69.0 -I lib/sundials_4.1.0/include -I lib/gtest_1.8.1/include -I lib/gtest_1.8.1 -I lib/gtest_1.8.1/include -I lib/gtest_1.8.1 -DBOOST_RESULT_OF_USE_TR1 -DBOOST_NO_DECLTYPE -DBOOST_DISABLE_ASSERTS -DBOOST_PHOENIX_NO_VARIADIC_EXPRESSION -DSTAN_OPENCL -DOPENCL_DEVICE_ID=0 -DOPENCL_PLATFORM_ID=0 -DCL_USE_DEPRECATED_OPENCL_1_2_APIS -DGTEST_USE_OWN_TR1_TUPLE -DGTEST_USE_OWN_TR1_TUPLE -c -o test/unit/math/opencl/sub_block_test.o test/unit/math/opencl/sub_block_test.cpp
In file included from test/unit/math/opencl/opencl_context_test.cpp:2:
./stan/math/opencl/opencl_context.hpp:43:3: error: use of undeclared identifier 'assert'
assert(values.size() == len);
^
1 error generated.
<builtin>: recipe for target 'test/unit/math/opencl/opencl_context_test.o' failed
make: *** [test/unit/math/opencl/opencl_context_test.o] Error 1
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1171 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAZ_F8Blp70_WUVww_Ds8-vo1WHeje2tks5vZKUxgaJpZM4cChJ9>
.
|
|
On Mar 22, 2019, at 12:52 AM, Daniel Lee ***@***.***> wrote:
To recap, there's a bit of a discrepancy between the microbenchmark and the end-to-end results. We still don't know the cause of that and it would make sense for us to have some explanation as to why that could possibly happen.
The end-to-end results are the ones that matter. This just indicates to me that there's a conceptual error in the way we're microbenchmarking or we're microbenchmarking something that doesn't take up a lot of CPU time to start with.
It would be ideal if we had some automated process to run the benchmarks across the 3 OSes (2 for this particular PR) to make sure that it is indeed faster. At this level of optimization of code, benchmarking results don't translate across OSes, across different compilers, and even across different compiler versions (including minor).
I agree. We can't cover all possibilities, but it'd be nice to have some sanity checks in place so PRs like this one can be expedited.
Since the current proposed implementation uses C preprocessor macros, we should try to make it as clear as possible prior to merging. I believe we've all come to the agreement that given the potential improvement to speed, we can deal with the non-standard (relative to Math) code, but we should take maintenance into serious consideration.
Sorry, I missed what's non-standard here. We use preprocessor macros now for thread safety and for assembler hints with (un)likely.
If you haven't seen our style guide, it's here: Coding Style and Idioms (github wiki). Just FYI, we're doing what the Google Style Guide explicitly says not to do. This is from the Preprocessor Macros section:
Instead of using a macro to conditionally compile code ... well, don't do that at all (except, of course, for the #define guards to prevent double inclusion of header files). It makes testing much more difficult.
Yet you see Eigen and Boost do it all over the place, so I don't think it should be completely banned.
And further down:
But before using a macro, consider carefully whether there's a non-macro way to achieve the same result. If you need to use a macro to define an interface, contact your project leads to request a waiver of this rule.
Is there a non-macro alternative?
This code is a lot better than the first and second versions, but I think there are ways we can make it so it's really ok to pull in.
Specifics would help here to lay out requirements for being "OK to pull in". I think part of the problem we're having is that the requirements are vague. I know it's impossible to have an automated way to do this, but some more guidance would be really helpful for developers.
If possible, we should try to get the code designed that way.
Sorry, I didn't catch the reference of "that way". What way are you suggesting?
These things tend not to be fixed once in.
Saying "fixed" implies there's something broken to begin with. What specifically do you think is broken? Is it just the use of macros or something deeper?
And that in turn leads to it being difficult for it to be fixed in the future.
I think we have to guard against headaches in the future, but the future is uncertain, so there's only so much we can do to guard against problems.
I'm trying to prevent this bit of code from locking us in and slowing down future development. Especially as we're moving towards parallelization and requiring more fine-grained control over the autodiff stack.
What needs to happen to move you from preventing the code from getting in to accepting it?
|
|
Just wanted to leave some notes quickly, there are non-macro alternatives:
But honestly, I'm ok with the macro approach if we cleaned it up a bit. By that, I mean minimizing the number of places where we're switching between the different branches, so it's really an API and two different implementations that are switched based on the preprocessor variable value. We should really doc what that API is. One easy thing to do within autodiffstackstorage is to actually move all the things that are conditional on the I don't know about anyone else, but I find this harder to read than |
…ndence of STAN_THREADS & compiler extension supported
|
I tried to address your points, @syclik... I think this improves the readability (I have seen such a pattern in other libraries as well, e.g. boost or the Intel TBB do these kind of moves if that's worth anything). |
|
Thanks! That looks a lot cleaner!
…On Fri, Mar 22, 2019 at 2:37 PM wds15 ***@***.***> wrote:
I tried to address your points, @syclik <https://github.com/syclik>... I
think this improves the readability (I have seen such a pattern in other
libraries as well, e.g. boost or the Intel TBB do these kind of moves if
that's worth anything).
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1171 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAZ_F9PgWWdSS4WwaaByOyX_r6wE3cRdks5vZSLZgaJpZM4cChJ9>
.
|
|
I like what @wds15 did a lot! re: the examples in your post, @syclik - I find it easier to read when the differences are highlighted for me. But perhaps the @wds15 approach combines the best of both worlds - just give everything a preprocessor def and then use those in the code wherever there would be a difference. Just like regular programming but with ifdefs :) |
|
Complete agree. I think that makes it a lot easier to read. (If I were to
be nit picky, I'd pick a different variable name that doesn't have "THREAD"
in the name, but I'm ok with this.)
…On Fri, Mar 22, 2019 at 4:13 PM seantalts ***@***.***> wrote:
I like what @wds15 <https://github.com/wds15> did a lot! re: the examples
in your post, @syclik <https://github.com/syclik> - I find it easier to
read when the differences are highlighted for me. But perhaps the @wds15
<https://github.com/wds15> approach combines the best of both worlds -
just give everything a preprocessor def and then use those in the code
wherever there would be a difference. Just like regular programming but
with ifdefs :)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1171 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAZ_F0Wlmw14uaJoss16SZ9Cpvvhy4mqks5vZTlkgaJpZM4cChJ9>
.
|
|
@rok-cesnovar Weird...I haven't done anything wrt to the assert, but now it just passes. I like if problems go away without any effort, but that is odd. |
|
Yeah, its pretty random, since it passed both the PR tests as well as the tests on develop and only showed up on this PR. #1172 is finished, it just needs a re-approval after the last changes, so we wont have to deal with that anymore. |
| #ifndef STAN_MATH_PRIM_MAT_FUNCTOR_MAP_RECT_HPP | ||
| #define STAN_MATH_PRIM_MAT_FUNCTOR_MAP_RECT_HPP | ||
|
|
||
| #include <stan/math/prim/scal/meta/return_type.hpp> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[minor]
Alphabetize includes. Move this down a few lines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure where to move given your comments. I order these things by dependency scal/arr/mat and then alphabetically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In prim, these should be strictly alphabetical. I know our codebase hasn't caught up yet, but that's what it should be. For rev, you'll want to bring in the core.hpp first, then everything else in alphabetical order (irrespective of scal/arr/mat).
The files that shouldn't follow this: the big includes like stan/math/prim/mat.hpp. These aren't because they need to include traits in the right order.
| namespace math { | ||
|
|
||
| /** | ||
| * Provides a thread_local singleton if needed. Read warnings below! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Description line should be updated. This struct now always provides access to the stack using the singleton pattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope what I changed is fine.
| namespace math { | ||
| namespace { | ||
|
|
||
| ChainableStack global_stack_instance_init; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add some doc. Just enough to let the user know that this is here for a reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
| * thread_local. If there is no loss in performance, we can remove | ||
| * this ifdef. | ||
| * The initialzation of the AD instance is handled by the lifetime of | ||
| * a AutodiffStackSingleton object. More specifically, the first |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This ownership point is important, but it's not as clearly linked to the code (no mention of own_instance_ here) as I think it should be.
If I understand this correctly, the "first instance" if there are no threads is the global one defined in init_chainablestack.hpp?
And if there are threads, this doc doesn't seem right... we don't need to call init() anymore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need to link to private members of the class in the doc if the behavior is described? Well, I added some refs now.
Your understanding is correct... but the one in init_chainablestack.hpp does not have to be the first one if the C++ user wants more control over things and links things together so that what is execute first changes, then the first instance will be another one.
I corrected the ill references to init.
|
@wds15, I'll get us set up so the benchmark runs on 3 platforms appropriately. |
@wds15, thanks for pointing that out. I'm reporting the ( Although the seeds across Mac and Linux were the same, the actual execution is different, so it's hard to compare across the two. What is important is that fixing the seed on a platform, we're computing exactly the same thing and so it's down to how fast we can compute that. You're right: going from 1 to 2 threads takes longer on Linux. That's fine... it happens under both develop and that version. It might be something we look into, but we don't have to do it here. But this is why we need to run it on all platforms with the compilers we've set. Although reason would say that confirming on one platform should be good enough, it turns out it's not. |
|
@syclik I still think that relative statements should hold most of the time and I would be good with knowing that things speedup only on Mac, for example. This should by all means translate to the other platforms as well. Since we do not have an automated platform I would rather save our time for doing all this manually. Now, if we would have an automated facility for running things on all platforms we want to look into ... that is preferable, of course; but until that does not exist we can really make decisions without that from my perspective. ... but you seem to be very into the benchmarks for this one - so I assume you run your code once more and do an update with the v4 version of this PR on the hash 13cf353, right? |
|
And I am actually relatively sure that in your Linux 2-thread runs something went wrong. I did run exactly this example with threading and MPI under Linux, see on page 10. There I am getting a ~1.2x speedup for threading. |
| #include <gtest/gtest.h> | ||
| #include <stan/math/fwd/core/fvar.hpp> | ||
| #include <stan/math/rev/core/var.hpp> | ||
| #include <stan/math/rev/core/init_chainablestack.hpp> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wds15, do you know why this test needs this when the rest don't? It looks like the includes for the test don't actually have a mix header in there; maybe that would fix it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mix headers probably fix it, yes (but I am not familiar with the mix stuff). It looks to me as if the test does not include the right includes, yes. I do not know mix stuff at all which is why I plugged in the init_chainablestack.hpp.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
�Ok. I made an educated guess what the includes should be... but have a look - I am not familiar with mix.
|
Thanks! I just sent you an email off-line. |
|
I can't believe it... we are again stopped at the very same test under Windows. This one breaks now on develop:
Any idea? Do we unmerge again? I have enabled to my best knowledge exactly this test and it did pass. |
|
That's not actually where it's stopping. I ran it from the windows box:
C:\Users\daniel\math>test\unit\math\rev\arr\functor\coupled_ode_system_test.exe
Running main() from lib/gtest_1.8.1/src/gtest_main.cc
[==========] Running 18 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 18 tests from StanAgradRevOde
[ RUN ] StanAgradRevOde.coupled_ode_system_dv
[ OK ] StanAgradRevOde.coupled_ode_system_dv (0 ms)
[ RUN ] StanAgradRevOde.decouple_states_dv
[ OK ] StanAgradRevOde.decouple_states_dv (0 ms)
[ RUN ] StanAgradRevOde.initial_state_dv
[ OK ] StanAgradRevOde.initial_state_dv (0 ms)
[ RUN ] StanAgradRevOde.size_dv
[ OK ] StanAgradRevOde.size_dv (0 ms)
[ RUN ] StanAgradRevOde.memory_recovery_dv
[ OK ] StanAgradRevOde.memory_recovery_dv (0 ms)
[ RUN ] StanAgradRevOde.memory_recovery_exception_dv
It stops there. There's definitely an error there. I don't know what it is
yet.
…On Mon, Apr 1, 2019 at 10:46 AM wds15 ***@***.***> wrote:
I can't believe it... we are again stopped at the very same test under
Windows. This one breaks now on develop:
StanAgradRevOde.memory_recovery_dv
Any idea? Do we unmerge again?
I have enabled to my best knowledge exactly this test and it did pass.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1171 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAZ_F1Afrsb-jAjKlTVPGZ5nGEpKVekHks5vchusgaJpZM4cChJ9>
.
|
|
There is something very odd going on... and I almost doubt that this PR is the root cause ... but as there is no other change it definitely looks like, sure. What really confuses me is that exactly this test was OK on the commit 1031685 . You can have a look here: The logs there do include exactly this test and show that it did run just fine... and this is what totally confuses me. Is there any difference in how the tests are run on develop vs the one I linked in above. Maybe @seantalts has an idea? |
|
There is definitely something very odd going on. I agree. If I run it from the python script, it runs fine. If I just run the executable, it stops. I have no idea what could cause that: python script: Running the exectuable: and it stops there. So far, this is repeatable. |
|
@wds15, bad news: this PR changes that behavior. |
|
I just went back and ran that exact test using those two commands (runTests and running the executable). No problems running both. |
|
Mind reverting the PR? We'll try to get to what's happening, but for now, let's get develop back into a working state. (I won't get back to it for at least a few hours.) |
|
Can we just push revert in this PR? Or do we need a new issue, new PR & another v5 PR for this one? What I still do not understand is that the test did pass our pipeline in the git hash I quoted above. Super weird. Maybe @ariddell or @ahartikainen have an idea? |
|
We have seen some exotic TLS bugs with specific versions of Mingw-w64. Is that the compiler that Windows is using? |
|
The tests which break are not using STAN_THREADS, but yes... this is the compiler mingw-w64 gcc 4.9.3.
(I am not saying this should be tried in develop which need to revert; just some ideas I would follow). |
|
Only for the record: The test failing now on Windows develop did pass here:
|
|
We've only encountered these unusual bugs with STAN_THREADS turned on,
so perhaps our experience isn't useful.
One thing that might be helpful: if you can get the Windows exit code
associated with the crash you can often learn something about the cause
of the crash. (This was new to me since you can't do this with Linux.)
…On 4/1/19 2:06 PM, wds15 wrote:
The tests which break are not using STAN_THREADS, but yes... this is the
compiler mingw-w64 gcc 4.9.3.
@syclik <https://github.com/syclik>
* Do we have multiple windows test machines and this tests fails on
one, but not the other maybe?
* Are u aware of differences in shells being used by python and what
you use on the command line?
* Could we try using the powershell?
* Could we try using mingw32-make to execute the tests (which uses a
different shell)?
(I am not saying this should be tried in develop which need to revert;
just some ideas I would follow).
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#1171 (comment)>, or
mute the thread
<https://github.com/notifications/unsubscribe-auth/AmFA1sZekxVbpBvsnQaxleZ4TBrhOWIpks5vckq-gaJpZM4cChJ9>.
|
|
I just reverted the merge and created a PR. |
|
I'll try to grab the windows exit code.
@wds15,
- I've only tried on one machine.
- sure. Are you thinking it's a permissions thing? Maybe Python subprocess
has a different shell environment?
- where is the other make from?
On Mon, Apr 1, 2019 at 2:20 PM riddell-stan <[email protected]>
wrote:
… We've only encountered these unusual bugs with STAN_THREADS turned on,
so perhaps our experience isn't useful.
One thing that might be helpful: if you can get the Windows exit code
associated with the crash you can often learn something about the cause
of the crash. (This was new to me since you can't do this with Linux.)
On 4/1/19 2:06 PM, wds15 wrote:
> The tests which break are not using STAN_THREADS, but yes... this is the
> compiler mingw-w64 gcc 4.9.3.
>
> @syclik <https://github.com/syclik>
>
> * Do we have multiple windows test machines and this tests fails on
> one, but not the other maybe?
> * Are u aware of differences in shells being used by python and what
> you use on the command line?
> * Could we try using the powershell?
> * Could we try using mingw32-make to execute the tests (which uses a
> different shell)?
>
> (I am not saying this should be tried in develop which need to revert;
> just some ideas I would follow).
>
> —
> You are receiving this because you commented.
> Reply to this email directly, view it on GitHub
> <#1171 (comment)>, or
> mute the thread
> <
https://github.com/notifications/unsubscribe-auth/AmFA1sZekxVbpBvsnQaxleZ4TBrhOWIpks5vckq-gaJpZM4cChJ9
>.
>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1171 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAZ_F_h51d2hl9wnT2tvxPaagCd6EDteks5vck3WgaJpZM4cChJ9>
.
|
|
Also, we can build without make at all if you think it’s a build problem.
…On Tue, Apr 2, 2019 at 8:11 AM Daniel Lee ***@***.***> wrote:
I'll try to grab the windows exit code.
@wds15,
- I've only tried on one machine.
- sure. Are you thinking it's a permissions thing? Maybe Python subprocess
has a different shell environment?
- where is the other make from?
On Mon, Apr 1, 2019 at 2:20 PM riddell-stan ***@***.***>
wrote:
> We've only encountered these unusual bugs with STAN_THREADS turned on,
> so perhaps our experience isn't useful.
>
> One thing that might be helpful: if you can get the Windows exit code
> associated with the crash you can often learn something about the cause
> of the crash. (This was new to me since you can't do this with Linux.)
>
> On 4/1/19 2:06 PM, wds15 wrote:
> > The tests which break are not using STAN_THREADS, but yes... this is
> the
> > compiler mingw-w64 gcc 4.9.3.
> >
> > @syclik <https://github.com/syclik>
> >
> > * Do we have multiple windows test machines and this tests fails on
> > one, but not the other maybe?
> > * Are u aware of differences in shells being used by python and what
> > you use on the command line?
> > * Could we try using the powershell?
> > * Could we try using mingw32-make to execute the tests (which uses a
> > different shell)?
> >
> > (I am not saying this should be tried in develop which need to revert;
> > just some ideas I would follow).
> >
> > —
> > You are receiving this because you commented.
> > Reply to this email directly, view it on GitHub
> > <#1171 (comment)>,
> or
> > mute the thread
> > <
> https://github.com/notifications/unsubscribe-auth/AmFA1sZekxVbpBvsnQaxleZ4TBrhOWIpks5vckq-gaJpZM4cChJ9
> >.
> >
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#1171 (comment)>, or mute
> the thread
> <https://github.com/notifications/unsubscribe-auth/AAZ_F_h51d2hl9wnT2tvxPaagCd6EDteks5vck3WgaJpZM4cChJ9>
> .
>
|
|
That other make is part of RTools and is already on the machine. Everything builds on windows with it ... I am using it already in the TBB PR which I am still working on, but the TBB Windows build works great with mingw32-make (and one of the differences to the vanilla make is the shell being used). |
|
Maybe this is useful: http://mail.openjdk.java.net/pipermail/build-dev/2016-March/016783.html Which means to set these option during compilation with gcc: -fno-delete-null-pointer-checks and -fno-lifetime-dse ... this is wild guess. |
|
@wds15 and @ariddell: here's what I get: It looks like it's a permissions thing, but I'm not sure what that means. And why it's just for this test.
I'll use docker and spin one up this week.
No idea. I'm not that great with Python's internals, so I don't know what subprocess actually does at this level.
Short: it runs ok from powershell. It still does not work from the normal command prompt. I git cleaned from the powershell. I rebuilt the test. I ran the test from the powershell and it was ok. I ran the same executable from a normal terminal and it was not ok.
I don't understand what that means.
This didn't help. I added this to |
|
Exit code -1073741819 (0xC0000005) is, from what I have read, a memory access violation error. |
|
I think you can debug this sort of error with windbg or cdb attached to the faulting process. |
|
@syclik @wds15 is there a single command one can run to get the crash? If there is, perhaps we could find someone with some experience using the Windows debugger to find out more information? @ahartikainen have you used windgb or cdb before? |
|
Well, a unit test is failing under the cmd shell while the test works when called from python or from the PowerShell on Windows. On the other platforms the test works. I suspect a bug in some software, but I doubt that it is Stan related - gcc or googletest are good candidates here. We should try what happens without optimizations from the compiler or even change the test a bit. The test uses a recover memory in the tear down command. Maybe moving that to the startup command fixes the issue already. This is a really weird thing...and I hope very much we get some fix for it, but I am not sure if we will understand the root cause (would be nice if we get there, of course). If someone has a windows debugger at hand and working...that would be great. |
|
Yes, there is an example that can be run. If you want to run it directly: That will build first then run. If you want to use the script: This one is tricky. It might be as @wds15 suggested. Or not. My guess would be that assumptions about when objects are created and destroyed may not be valid, but that's just a guess at this moment until I can verify. |
|
I suspect it's a mingw specific error. But we are stuck with mingw, so I hope there is a workaround. And I suspect there is a workaround, if it's a mingw issue; many people are using TLS now. Supporting TLS on Windows is just about the most important thing for PyStan going forward. I'm tempted to find a Windows VPS (Azure?) and debug this myself. (Should this discussion be moved to the new PR?) |
|
I don't think it is. I think this has uncovered a bug. I'm still trying to
pin down the exact condition; I've spent more than a few hours debugging.
I'll ended a few more to get to something repeatable.
…On Wed, Apr 17, 2019 at 11:21 AM riddell-stan ***@***.***> wrote:
I suspect it's a mingw specific error. But we are stuck with mingw, so I
hope there is a workaround. And I suspect there *is* a workaround, if
it's a mingw issue; many people are using TLS now.
Supporting TLS on Windows is just about the most important thing for
PyStan going forward. I'm tempted to find a Windows VPS (Azure?) and debug
this myself.
(Should this discussion be moved to the new PR?)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1171 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAZ_FwQM4SpX4FSYXQGjBzXAc_EdPL2Pks5vhzwNgaJpZM4cChJ9>
.
|
This PR is a continuation of the reverted one from this merged PR: #1135
This PR is derived from develop which I reverted to 883d781 as suggested by @seantalts
This PR speeds up the access to the AD tape whenever
thread_localstorage (TLS) is needed for the AD tape. In addition programs with TLS now work on Windows with gcc 4.9.3.Summary
This PR has the main goal to reduce the performance penalty implied by turning on thread capability of stan-math. As it turns out, the proposed implementation works out of the box on Windows with the gcc 4.9.3 compiler which is not the case with the current threading implementation in develop. So we fix a bug which affects many users given the popularity of Windows. The Windows bug is tickled in the commit a7bf778 on Jenkins and it is fixed in the current form of the PR.
The performance benefits of this PR are documented in the previous PR which served as a first prototype. In summary, we loose ~20% on develop when turning on threading while with this PR we reduce this to less than 3% on average.
The key change of this PR is to change how the global AD tape instance is declared and accessed. Currently in develop the global AD tape is declared
This PR changes this to a global pointer to the AD tape, which is a TLS for threading, and which is initialized to the constant
nullptrexpression at compile-time. Initialization to a constant expression at compile-time (as opposed to a dynamic expression right now in develop) allows the compiler to use the TLS global pointer more efficiently. Furthermore, thethread_localkeyword used to declare TLS can be swapped for the compiler-specific__threadkeyword whenever the expression is constant on gnu extension compatible compilers. The__threadkeyword has been around much longer than the C++11thread_localkeyword and only the__threadkeyword guarantees to the compiler the the variable is (in fact must be) initialized with a constant expression which allows the use of fast TLS code.The runtime initialisation of the AD tape for the main process is automatically achieved through the instantiation of a global object which manages the life-time of the AD stack (as performed in the new file
math/stan/math/rev/core/init_chainablestack.hpp).Side-effects
There is no change in behavior for programs whenever
STAN_THREADSis not defined. IfSTAN_THREADSis defined, then any C++ client library which uses the stan-math functions (map_rectin particular) will continue to work unchanged. However, C++ clients which use AD in child threads will stop working. The reason is that the AD tape requires initialization in child threads. The initialization of the AD tape can be done by constructing aChainableStackinstance which will trigger AD tape initialization for the thread. The first instantiation ofChainableStackin a given child thread will take ownership of the AD tape. Thus only the first created instance will free the AD tape resource upon destruction.Tests
test/unit/math/rev/core/thread_stack_instance_test.cppis a new test and is written to test matters with and without threading. The test is setup to test different things depending on enabling threading or not.STAN_THREADS NOT defined (no threading). In this case we only have a global AD tape. All child-threads created by a given program will see the same AD tape. Thus initializing the AD tape on the main thread implies that it is initialized on the child-thread, etc.
STAN_THREADS IS defined (threading case). Now all new threads created must see their own TLS of the AD tape pointer which is not initialized at the beginning and is independent of the main thread AD tape when performing AD operations.
In addition I have added to the
map_rect_concurrenttests code which enforces that these tests are always run with 4 threads regardless of theSTAN_NUM_THREADSdefinition (which is often not set and this things do not run threaded during testing).Checklist
Math issue Thread performance penalty #1102 & map_rect with threading broken on windows #1134
Copyright holder: Sebastian Weber
The copyright holder is typically you or your assignee, such as a university or company. By submitting this pull request, the copyright holder is agreeing to the license the submitted work under the following licenses:
- Code: BSD 3-clause (https://opensource.org/licenses/BSD-3-Clause)
- Documentation: CC-BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
the basic tests are passing
./runTests.py test/unit)make test-headers)make doxygen)make cpplint)the code is written in idiomatic C++ and changes are documented in the doxygen
the new changes are tested
Remaining TODO
End-to-end performance results
@syclik ran @wds15's benchmark on Mac with clang++ 6 and Linux with g++ 5 using these scripts: syclik/benchmark-stan. These results were generated on the git hash 95844ac using the all pointer design (with an instance() method abstraction).
Summary
This PR is definitely faster for threading (compared to develop with threading)! These are results for 13cf353.
Data (time in seconds per run)
(I still wouldn't suggest using this for single-threaded mode. I haven't benchmarked as thoroughly since it's not what this is about, but on Linux, it's about break even, but on Mac, it's about 10% slower.)