Skip to content

Conversation

@wds15
Copy link
Contributor

@wds15 wds15 commented Mar 21, 2019

This PR is a continuation of the reverted one from this merged PR: #1135

This PR is derived from develop which I reverted to 883d781 as suggested by @seantalts

This PR speeds up the access to the AD tape whenever thread_local storage (TLS) is needed for the AD tape. In addition programs with TLS now work on Windows with gcc 4.9.3.

Summary

This PR has the main goal to reduce the performance penalty implied by turning on thread capability of stan-math. As it turns out, the proposed implementation works out of the box on Windows with the gcc 4.9.3 compiler which is not the case with the current threading implementation in develop. So we fix a bug which affects many users given the popularity of Windows. The Windows bug is tickled in the commit a7bf778 on Jenkins and it is fixed in the current form of the PR.

The performance benefits of this PR are documented in the previous PR which served as a first prototype. In summary, we loose ~20% on develop when turning on threading while with this PR we reduce this to less than 3% on average.

The key change of this PR is to change how the global AD tape instance is declared and accessed. Currently in develop the global AD tape is declared

  • for the non-threading case as a static global instance.
  • for the threading case as a TLS which is declared as a static instance of a static function (Meyer singleton design).

This PR changes this to a global pointer to the AD tape, which is a TLS for threading, and which is initialized to the constant nullptr expression at compile-time. Initialization to a constant expression at compile-time (as opposed to a dynamic expression right now in develop) allows the compiler to use the TLS global pointer more efficiently. Furthermore, the thread_local keyword used to declare TLS can be swapped for the compiler-specific __thread keyword whenever the expression is constant on gnu extension compatible compilers. The __thread keyword has been around much longer than the C++11 thread_local keyword and only the __thread keyword guarantees to the compiler the the variable is (in fact must be) initialized with a constant expression which allows the use of fast TLS code.

The runtime initialisation of the AD tape for the main process is automatically achieved through the instantiation of a global object which manages the life-time of the AD stack (as performed in the new file math/stan/math/rev/core/init_chainablestack.hpp).

Side-effects

There is no change in behavior for programs whenever STAN_THREADS is not defined. If STAN_THREADS is defined, then any C++ client library which uses the stan-math functions (map_rect in particular) will continue to work unchanged. However, C++ clients which use AD in child threads will stop working. The reason is that the AD tape requires initialization in child threads. The initialization of the AD tape can be done by constructing a ChainableStack instance which will trigger AD tape initialization for the thread. The first instantiation of ChainableStack in a given child thread will take ownership of the AD tape. Thus only the first created instance will free the AD tape resource upon destruction.

Tests

test/unit/math/rev/core/thread_stack_instance_test.cpp is a new test and is written to test matters with and without threading. The test is setup to test different things depending on enabling threading or not.

  • STAN_THREADS NOT defined (no threading). In this case we only have a global AD tape. All child-threads created by a given program will see the same AD tape. Thus initializing the AD tape on the main thread implies that it is initialized on the child-thread, etc.

  • STAN_THREADS IS defined (threading case). Now all new threads created must see their own TLS of the AD tape pointer which is not initialized at the beginning and is independent of the main thread AD tape when performing AD operations.

In addition I have added to the map_rect_concurrent tests code which enforces that these tests are always run with 4 threads regardless of the STAN_NUM_THREADS definition (which is often not set and this things do not run threaded during testing).

Checklist

Remaining TODO

  • @seantalts will review the code at the implementation level.
  • We will run these benchmarks after we've caught up to develop (and hopefully after we've made any additional code changes):
    • microbenchmark for develop vs this branch without any threading. This will be run on Window, Linux, Mac for the compilers we've said we'd support (one per OS).
    • @wds15's non-stiff integrator example for develop vs this branch with threading. This will run on Linux and Mac for the compilers we've said we'd support. (Windows does not work currently on develop.)
  • If all of the benchmarks show no more than 5% speed regression, then it's acceptable performance-wise. (We have every reason to believe that we'll get a large speed benefit for threading.)
  • @syclik will run the benchmarks for the non-stiff integrator example by Monday. If he's unable to run the benchmarks by Monday, we will use the microbenchmark results only.

End-to-end performance results

@syclik ran @wds15's benchmark on Mac with clang++ 6 and Linux with g++ 5 using these scripts: syclik/benchmark-stan. These results were generated on the git hash 95844ac using the all pointer design (with an instance() method abstraction).

Summary

This PR is definitely faster for threading (compared to develop with threading)! These are results for 13cf353.

  • Linux, g++ 5:
    • 1 thread: 18% faster
    • 2 threads: 11% faster
    • 4 threads: 13% faster
  • Mac, clang++ 6:
    • 1 thread: 27% faster
    • 2 threads: 25% faster
    • 4 threads: 26% faster

Data (time in seconds per run)

os threads branch n mean sd min max
linux 1 new 206.9587 2.2625566 203.053 210.831
linux 1 old 255.1863 2.6885338 251.687 260.186
linux 2 new 228.5295 0.4830041 227.657 229.365
linux 2 old 257.9766 1.2591310 256.384 260.209
linux 4 new 156.5684 0.4185725 155.836 157.302
linux 4 old 180.3396 0.1977036 180.090 180.629
mac 1 new 314.1805 11.3612116 301.316 340.606
mac 1 old 427.6057 28.8460120 405.472 506.568
mac 2 new 175.2279 3.5078966 167.983 179.516
mac 2 old 234.5244 6.4574878 226.599 246.992
mac 4 new 102.7276 2.9450846 98.632 107.473
mac 4 old 138.6252 3.7742022 132.455 145.227

(I still wouldn't suggest using this for single-threaded mode. I haven't benchmarked as thoroughly since it's not what this is about, but on Linux, it's about break even, but on Mac, it's about 10% slower.)

@syclik
Copy link
Member

syclik commented Mar 22, 2019

To recap, there's a bit of a discrepancy between the microbenchmark and the end-to-end results. We still don't know the cause of that and it would make sense for us to have some explanation as to why that could possibly happen.

It would be ideal if we had some automated process to run the benchmarks across the 3 OSes (2 for this particular PR) to make sure that it is indeed faster. At this level of optimization of code, benchmarking results don't translate across OSes, across different compilers, and even across different compiler versions (including minor).

Since the current proposed implementation uses C preprocessor macros, we should try to make it as clear as possible prior to merging. I believe we've all come to the agreement that given the potential improvement to speed, we can deal with the non-standard (relative to Math) code, but we should take maintenance into serious consideration.

If you haven't seen our style guide, it's here: Coding Style and Idioms (github wiki). Just FYI, we're doing what the Google Style Guide explicitly says not to do. This is from the Preprocessor Macros section:

Instead of using a macro to conditionally compile code ... well, don't do that at all (except, of course, for the #define guards to prevent double inclusion of header files). It makes testing much more difficult.

And further down:

But before using a macro, consider carefully whether there's a non-macro way to achieve the same result. If you need to use a macro to define an interface, contact your project leads to request a waiver of this rule.

This code is a lot better than the first and second versions, but I think there are ways we can make it so it's really ok to pull in. If possible, we should try to get the code designed that way. These things tend not to be fixed once in. And that in turn leads to it being difficult for it to be fixed in the future. I'm trying to prevent this bit of code from locking us in and slowing down future development. Especially as we're moving towards parallelization and requiring more fine-grained control over the autodiff stack.

@wds15
Copy link
Contributor Author

wds15 commented Mar 22, 2019

Are we really going down the route of quoting SOPs and all that? I don't hope so! We are all skilled enough and have a good enough sense for what's good and what not.

Wrt. to the ifdefs: My understanding was we are fine with the ifdefs we use right now. These should be used sparingly, of course. However, we have in our own codebase a few exceptions to that (__WIN32__ or __GNUC__, for example). All ifdefs we use right have been agreed by us - we anyway do STAN_THREADS and __GNUC__ was waived as well for good reasons.

Did you have anything specific in mind for the ifdefs? A simplification I could think of is like

// at the top of autodiffstackstorage we do:
#ifdef STAN_THREADS
#ifdef __GNUC__
      #define STAN_THREADS_DEF __thread
#else
      #define STAN_THREADS_DEF thread_local
#endif
#else
  // empty def for no-threading case
   #define STAN_THREADS_DEF
#endif

// then the definitions are done with

static STAN_THREADS_DEF AutodiffStackStorage *instance_;

Wrt. to the benchmarks: We have to my knowledge a single signal which is the MacPro microbenchmark only. It would be good to understand this case. The information missing so far for me is what OS and what exact CPU is in that machine. I do think that we should try to test only on the latest OS on Mac, Mojave. The rationale is that Mojave should have the best fixes for those Intel bugs Spectre and Meltdown. Moreover, we need a common base to compare things if possible to further reduce the combinatorics. Also: I can't debug this at all! I just don't have such a machine. For reference, the compare-pointer.sh script gives on my machine with clang++-6.0 from macports (30 replications):

// develop, no STAN_THREADS
benchmark_autodiff_stack_mean        10349 ns      10346 ns      65401
benchmark_autodiff_stack_median      10305 ns      10304 ns      65401
benchmark_autodiff_stack_stddev        160 ns        157 ns      65401
// faster ad tls v4, no STAN_THREADS
benchmark_autodiff_stack_mean        10852 ns      10847 ns      63979
benchmark_autodiff_stack_median      10756 ns      10753 ns      63979
benchmark_autodiff_stack_stddev        320 ns        313 ns      63979

What would ease testing a little bit is to add the warfarin map_rect example to the performance benchmark suite, I think. This is a good end-to-end test for a map_rect program which we don't have yet. @seantalts Is that an option? Would this help us?

The only design which I haven't yet discussed to my knowledge is the way the life-time of the AD tape is managed. Right now the first instance ever created of ChainableStack will own the instance. This will be for the main process most likely the instance created in stan/math/rev/core/init_chainablestack.hpp. For child-threads the first instance of this class created will own the AD tape and thus tear it down. An alternative solution could be the use of a smart-pointer type of thing. So one would introduce a ref-count to the AD tape and only delete the instance once the ref count goes to zero. I went with the simpler approach as it does the job just fine and we can adapt that at a later stage if needed as I would see this as a very internal design choice irrelevant to any user (given we only support threading right now through map_rect).

Now, what this PR also shows is that the Windows gcc 4.9.3 struggles with a non-pointer design. The behavior I see suggests that not using pointers tickles bugs with this compiler. I don't have a better explanation. Thus going to all-pointer solves this.

So here is what is left from my perspective:

  1. Understand the MacPro microbenchmark thing - who can handle this? I don't have one.
  2. Potentially improve ifdef logic...one suggestion is above outlined - comments on the suggestion are welcome? Other ideas?
  3. Settle on the design of first ownership vs smart pointer thing - opinions?
  4. Run end-to-end benchmark on the final thing once more - maybe via performance cmdstan suite?

@syclik There is one thing weird in the result you report: When you go on linux from 1 to 2 threads the time increases... is that a mistake? It does not make sense to me. Also: What time do you report? We should use the wall-clock time as reported by time, which is the first line which starts with real from time. The reported run-time from cmdstan is the total CPU time which is not what we want.

@seantalts Right now I am getting a test failure (in the Full unit with GPU) which should not happen and is not related to this PR, ideas/suggestions?

clang++-6.0 -Werror -std=c++1y  -Wno-sign-compare   -I lib/opencl_1.2.8  -O3  -I . -I lib/eigen_3.3.3 -I lib/boost_1.69.0 -I lib/sundials_4.1.0/include -I lib/gtest_1.8.1/include -I lib/gtest_1.8.1 -I lib/gtest_1.8.1/include -I lib/gtest_1.8.1      -DBOOST_RESULT_OF_USE_TR1 -DBOOST_NO_DECLTYPE -DBOOST_DISABLE_ASSERTS -DBOOST_PHOENIX_NO_VARIADIC_EXPRESSION  -DSTAN_OPENCL -DOPENCL_DEVICE_ID=0 -DOPENCL_PLATFORM_ID=0 -DCL_USE_DEPRECATED_OPENCL_1_2_APIS  -DGTEST_USE_OWN_TR1_TUPLE -DGTEST_USE_OWN_TR1_TUPLE  -c -o test/unit/math/opencl/sub_block_test.o test/unit/math/opencl/sub_block_test.cpp
In file included from test/unit/math/opencl/opencl_context_test.cpp:2:
./stan/math/opencl/opencl_context.hpp:43:3: error: use of undeclared identifier 'assert'
  assert(values.size() == len);
  ^
1 error generated.
<builtin>: recipe for target 'test/unit/math/opencl/opencl_context_test.o' failed
make: *** [test/unit/math/opencl/opencl_context_test.o] Error 1

@wds15 wds15 changed the title WIP Feature/faster ad tls v4 Feature/faster ad tls v4 Mar 22, 2019
@rok-cesnovar
Copy link
Member

@wds15 the last one (assert error) is my fault, I apologize for the inconvenience. Sean mentioned that he will have a fix up on a PR.

@wds15
Copy link
Contributor Author

wds15 commented Mar 22, 2019

@rok-cesnovar NP... waiting a bit and then merging develop again is something I can handle.

@seantalts
Copy link
Member

You could also merge this branch in if you wanted to get it testing on Jenkins sooner, shouldn't trigger any merge conflicts or anything. branch is internal/no-assert

@syclik
Copy link
Member

syclik commented Mar 22, 2019 via email

@bob-carpenter
Copy link
Member

bob-carpenter commented Mar 22, 2019 via email

@syclik
Copy link
Member

syclik commented Mar 22, 2019

Just wanted to leave some notes quickly, there are non-macro alternatives:

  • template specialization
  • object oriented

But honestly, I'm ok with the macro approach if we cleaned it up a bit. By that, I mean minimizing the number of places where we're switching between the different branches, so it's really an API and two different implementations that are switched based on the preprocessor variable value. We should really doc what that API is.

One easy thing to do within autodiffstackstorage is to actually move all the things that are conditional on the STAN_THREADS variable into one block. Right now, it's spread in two places in the same file and it's hard to scan for the changes.

I don't know about anyone else, but I find this


  static
#ifdef STAN_THREADS
#ifdef __GNUC__
      __thread
#else
      thread_local
#endif
#endif
      AutodiffStackStorage *instance_;

harder to read than

#ifndef STAN_THREADS
    static AutodiffStackStorage *instance_;
#else
#ifdef __GNUC__
      static __thread AutodiffStackStorage *instance_;
#else
      static thread_local AutodiffStackStorage *instance_;
#endif
#endif

…ndence of STAN_THREADS & compiler extension supported
@wds15
Copy link
Contributor Author

wds15 commented Mar 22, 2019

I tried to address your points, @syclik... I think this improves the readability (I have seen such a pattern in other libraries as well, e.g. boost or the Intel TBB do these kind of moves if that's worth anything).

@syclik
Copy link
Member

syclik commented Mar 22, 2019 via email

@seantalts
Copy link
Member

I like what @wds15 did a lot! re: the examples in your post, @syclik - I find it easier to read when the differences are highlighted for me. But perhaps the @wds15 approach combines the best of both worlds - just give everything a preprocessor def and then use those in the code wherever there would be a difference. Just like regular programming but with ifdefs :)

@syclik
Copy link
Member

syclik commented Mar 22, 2019 via email

@wds15
Copy link
Contributor Author

wds15 commented Mar 23, 2019

@rok-cesnovar Weird...I haven't done anything wrt to the assert, but now it just passes. I like if problems go away without any effort, but that is odd.

@rok-cesnovar
Copy link
Member

Yeah, its pretty random, since it passed both the PR tests as well as the tests on develop and only showed up on this PR. #1172 is finished, it just needs a re-approval after the last changes, so we wont have to deal with that anymore.

#ifndef STAN_MATH_PRIM_MAT_FUNCTOR_MAP_RECT_HPP
#define STAN_MATH_PRIM_MAT_FUNCTOR_MAP_RECT_HPP

#include <stan/math/prim/scal/meta/return_type.hpp>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[minor]
Alphabetize includes. Move this down a few lines.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure where to move given your comments. I order these things by dependency scal/arr/mat and then alphabetically.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In prim, these should be strictly alphabetical. I know our codebase hasn't caught up yet, but that's what it should be. For rev, you'll want to bring in the core.hpp first, then everything else in alphabetical order (irrespective of scal/arr/mat).

The files that shouldn't follow this: the big includes like stan/math/prim/mat.hpp. These aren't because they need to include traits in the right order.

namespace math {

/**
* Provides a thread_local singleton if needed. Read warnings below!
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Description line should be updated. This struct now always provides access to the stack using the singleton pattern.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope what I changed is fine.

namespace math {
namespace {

ChainableStack global_stack_instance_init;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add some doc. Just enough to let the user know that this is here for a reason.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

* thread_local. If there is no loss in performance, we can remove
* this ifdef.
* The initialzation of the AD instance is handled by the lifetime of
* a AutodiffStackSingleton object. More specifically, the first
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ownership point is important, but it's not as clearly linked to the code (no mention of own_instance_ here) as I think it should be.

If I understand this correctly, the "first instance" if there are no threads is the global one defined in init_chainablestack.hpp?

And if there are threads, this doc doesn't seem right... we don't need to call init() anymore?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need to link to private members of the class in the doc if the behavior is described? Well, I added some refs now.

Your understanding is correct... but the one in init_chainablestack.hpp does not have to be the first one if the C++ user wants more control over things and links things together so that what is execute first changes, then the first instance will be another one.

I corrected the ill references to init.

@syclik
Copy link
Member

syclik commented Mar 24, 2019

@wds15, I'll get us set up so the benchmark runs on 3 platforms appropriately.

@syclik
Copy link
Member

syclik commented Mar 24, 2019

@syclik There is one thing weird in the result you report: When you go on linux from 1 to 2 threads the time increases... is that a mistake? It does not make sense to me. Also: What time do you report? We should use the wall-clock time as reported by time, which is the first line which starts with real from time. The reported run-time from cmdstan is the total CPU time which is not what we want.

@wds15, thanks for pointing that out. I'm reporting the real time. It is weird. I'll run it on the new code and see if that is still there. If you want to see all the numbers: https://github.com/syclik/benchmark-stan/blob/0b12f814bdc9dbd52ca8b38c87faba613c27738d/time.txt

(old = develop, new = feature/faster-ad-tls-v3)

Although the seeds across Mac and Linux were the same, the actual execution is different, so it's hard to compare across the two. What is important is that fixing the seed on a platform, we're computing exactly the same thing and so it's down to how fast we can compute that. You're right: going from 1 to 2 threads takes longer on Linux. That's fine... it happens under both develop and that version. It might be something we look into, but we don't have to do it here.

But this is why we need to run it on all platforms with the compilers we've set. Although reason would say that confirming on one platform should be good enough, it turns out it's not.

@wds15
Copy link
Contributor Author

wds15 commented Mar 24, 2019

@syclik I still think that relative statements should hold most of the time and I would be good with knowing that things speedup only on Mac, for example. This should by all means translate to the other platforms as well. Since we do not have an automated platform I would rather save our time for doing all this manually.

Now, if we would have an automated facility for running things on all platforms we want to look into ... that is preferable, of course; but until that does not exist we can really make decisions without that from my perspective.

... but you seem to be very into the benchmarks for this one - so I assume you run your code once more and do an update with the v4 version of this PR on the hash 13cf353, right?

@wds15
Copy link
Contributor Author

wds15 commented Mar 24, 2019

And I am actually relatively sure that in your Linux 2-thread runs something went wrong. I did run exactly this example with threading and MPI under Linux, see

https://github.com/stan-dev/stancon_talks/blob/master/2018-helsinki/Contributed-Talks/weber/stancon18-master/stancon18_odeWild.pdf

on page 10. There I am getting a ~1.2x speedup for threading.

#include <gtest/gtest.h>
#include <stan/math/fwd/core/fvar.hpp>
#include <stan/math/rev/core/var.hpp>
#include <stan/math/rev/core/init_chainablestack.hpp>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wds15, do you know why this test needs this when the rest don't? It looks like the includes for the test don't actually have a mix header in there; maybe that would fix it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mix headers probably fix it, yes (but I am not familiar with the mix stuff). It looks to me as if the test does not include the right includes, yes. I do not know mix stuff at all which is why I plugged in the init_chainablestack.hpp.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

�Ok. I made an educated guess what the includes should be... but have a look - I am not familiar with mix.

@syclik
Copy link
Member

syclik commented Apr 1, 2019

Thanks! I just sent you an email off-line.

@wds15
Copy link
Contributor Author

wds15 commented Apr 1, 2019

I can't believe it... we are again stopped at the very same test under Windows. This one breaks now on develop:

StanAgradRevOde.memory_recovery_dv

Any idea? Do we unmerge again?

I have enabled to my best knowledge exactly this test and it did pass.

@syclik
Copy link
Member

syclik commented Apr 1, 2019 via email

@wds15
Copy link
Contributor Author

wds15 commented Apr 1, 2019

There is something very odd going on... and I almost doubt that this PR is the root cause ... but as there is no other change it definitely looks like, sure.

What really confuses me is that exactly this test was OK on the commit 1031685 . You can have a look here:

http://d1m1s1b1.stat.columbia.edu:8080/blue/organizations/jenkins/Math%20Pipeline/detail/PR-1171/13/pipeline/90

The logs there do include exactly this test and show that it did run just fine... and this is what totally confuses me.

Is there any difference in how the tests are run on develop vs the one I linked in above. Maybe @seantalts has an idea?

@syclik
Copy link
Member

syclik commented Apr 1, 2019

There is definitely something very odd going on. I agree.

If I run it from the python script, it runs fine. If I just run the executable, it stops. I have no idea what could cause that:

python script:

>.\runTests.py test/unit/math/rev/arr/functor/coupled_ode_system_test.cpp
------------------------------------------------------------
make -j1 test/unit/math/rev/arr/functor/coupled_ode_system_test.exe
make: 'test/unit/math/rev/arr/functor/coupled_ode_system_test.exe' is up to date.
------------------------------------------------------------
test\unit\math\rev\arr\functor\coupled_ode_system_test.exe --gtest_output="xml:test/unit/math/rev/arr/functor/coupled_ode_system_test.xml"
Running main() from lib/gtest_1.8.1/src/gtest_main.cc
[==========] Running 18 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 18 tests from StanAgradRevOde
[ RUN      ] StanAgradRevOde.coupled_ode_system_dv
[       OK ] StanAgradRevOde.coupled_ode_system_dv (0 ms)
[ RUN      ] StanAgradRevOde.decouple_states_dv
[       OK ] StanAgradRevOde.decouple_states_dv (0 ms)
[ RUN      ] StanAgradRevOde.initial_state_dv
[       OK ] StanAgradRevOde.initial_state_dv (0 ms)
[ RUN      ] StanAgradRevOde.size_dv
[       OK ] StanAgradRevOde.size_dv (0 ms)
[ RUN      ] StanAgradRevOde.memory_recovery_dv
[       OK ] StanAgradRevOde.memory_recovery_dv (0 ms)
[ RUN      ] StanAgradRevOde.memory_recovery_exception_dv
[       OK ] StanAgradRevOde.memory_recovery_exception_dv (0 ms)
[ RUN      ] StanAgradRevOde.coupled_ode_system_vd
[       OK ] StanAgradRevOde.coupled_ode_system_vd (0 ms)
[ RUN      ] StanAgradRevOde.decouple_states_vd
[       OK ] StanAgradRevOde.decouple_states_vd (0 ms)
[ RUN      ] StanAgradRevOde.initial_state_vd
[       OK ] StanAgradRevOde.initial_state_vd (0 ms)
[ RUN      ] StanAgradRevOde.size_vd
[       OK ] StanAgradRevOde.size_vd (0 ms)
[ RUN      ] StanAgradRevOde.memory_recovery_vd
[       OK ] StanAgradRevOde.memory_recovery_vd (0 ms)
[ RUN      ] StanAgradRevOde.memory_recovery_exception_vd
[       OK ] StanAgradRevOde.memory_recovery_exception_vd (0 ms)
[ RUN      ] StanAgradRevOde.coupled_ode_system_vv
[       OK ] StanAgradRevOde.coupled_ode_system_vv (0 ms)
[ RUN      ] StanAgradRevOde.decouple_states_vv
[       OK ] StanAgradRevOde.decouple_states_vv (0 ms)
[ RUN      ] StanAgradRevOde.initial_state_vv
[       OK ] StanAgradRevOde.initial_state_vv (0 ms)
[ RUN      ] StanAgradRevOde.size_vv
[       OK ] StanAgradRevOde.size_vv (0 ms)
[ RUN      ] StanAgradRevOde.memory_recovery_vv
[       OK ] StanAgradRevOde.memory_recovery_vv (0 ms)
[ RUN      ] StanAgradRevOde.memory_recovery_exception_vv
[       OK ] StanAgradRevOde.memory_recovery_exception_vv (0 ms)
[----------] 18 tests from StanAgradRevOde (135 ms total)

[----------] Global test environment tear-down
[==========] 18 tests from 1 test case ran. (157 ms total)
[  PASSED  ] 18 tests.

Running the exectuable:

>test\unit\math\rev\arr\functor\coupled_ode_system_test.exe
Running main() from lib/gtest_1.8.1/src/gtest_main.cc
[==========] Running 18 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 18 tests from StanAgradRevOde
[ RUN      ] StanAgradRevOde.coupled_ode_system_dv
[       OK ] StanAgradRevOde.coupled_ode_system_dv (0 ms)
[ RUN      ] StanAgradRevOde.decouple_states_dv
[       OK ] StanAgradRevOde.decouple_states_dv (0 ms)
[ RUN      ] StanAgradRevOde.initial_state_dv
[       OK ] StanAgradRevOde.initial_state_dv (0 ms)
[ RUN      ] StanAgradRevOde.size_dv
[       OK ] StanAgradRevOde.size_dv (0 ms)
[ RUN      ] StanAgradRevOde.memory_recovery_dv
[       OK ] StanAgradRevOde.memory_recovery_dv (0 ms)
[ RUN      ] StanAgradRevOde.memory_recovery_exception_dv

and it stops there.

So far, this is repeatable.

@syclik
Copy link
Member

syclik commented Apr 1, 2019

@wds15, bad news: this PR changes that behavior.

@syclik
Copy link
Member

syclik commented Apr 1, 2019

I just went back and ran that exact test using those two commands (runTests and running the executable). No problems running both.

@syclik
Copy link
Member

syclik commented Apr 1, 2019

Mind reverting the PR? We'll try to get to what's happening, but for now, let's get develop back into a working state. (I won't get back to it for at least a few hours.)

@wds15
Copy link
Contributor Author

wds15 commented Apr 1, 2019

Can we just push revert in this PR? Or do we need a new issue, new PR & another v5 PR for this one?

What I still do not understand is that the test did pass our pipeline in the git hash I quoted above.

Super weird. Maybe @ariddell or @ahartikainen have an idea?

@riddell-stan
Copy link

We have seen some exotic TLS bugs with specific versions of Mingw-w64. Is that the compiler that Windows is using?

@wds15
Copy link
Contributor Author

wds15 commented Apr 1, 2019

The tests which break are not using STAN_THREADS, but yes... this is the compiler mingw-w64 gcc 4.9.3.

@syclik

  • Do we have multiple windows test machines and this tests fails on one, but not the other maybe?
  • Are u aware of differences in shells being used by python and what you use on the command line?
  • Could we try using the powershell?
  • Could we try using mingw32-make to execute the tests (which uses a different shell)?

(I am not saying this should be tried in develop which need to revert; just some ideas I would follow).

@wds15
Copy link
Contributor Author

wds15 commented Apr 1, 2019

Only for the record: The test failing now on Windows develop did pass here:

http://d1m1s1b1.stat.columbia.edu:8080/blue/rest/organizations/jenkins/pipelines/Math%20Pipeline/branches/PR-1171/runs/13/nodes/90/steps/125/log/?start=0

StanAgradRevOde.memory_recovery_exception_dv passes just fine there.

@riddell-stan
Copy link

riddell-stan commented Apr 1, 2019 via email

@syclik syclik mentioned this pull request Apr 1, 2019
5 tasks
@syclik
Copy link
Member

syclik commented Apr 1, 2019

I just reverted the merge and created a PR.

@syclik
Copy link
Member

syclik commented Apr 2, 2019 via email

@syclik
Copy link
Member

syclik commented Apr 2, 2019 via email

@wds15
Copy link
Contributor Author

wds15 commented Apr 2, 2019

That other make is part of RTools and is already on the machine. Everything builds on windows with it ... I am using it already in the TBB PR which I am still working on, but the TBB Windows build works great with mingw32-make (and one of the differences to the vanilla make is the shell being used).

@wds15
Copy link
Contributor Author

wds15 commented Apr 3, 2019

Maybe this is useful:

http://mail.openjdk.java.net/pipermail/build-dev/2016-March/016783.html

Which means to set these option during compilation with gcc:

-fno-delete-null-pointer-checks and -fno-lifetime-dse

... this is wild guess.

@syclik
Copy link
Member

syclik commented Apr 8, 2019

@wds15 and @ariddell: here's what I get:

C:\Users\daniel\math>echo Error code is %errorlevel%
Error code is -1073741819

It looks like it's a permissions thing, but I'm not sure what that means. And why it's just for this test.

@syclik

  • Do we have multiple windows test machines and this tests fails on one, but not the other maybe?

I'll use docker and spin one up this week.

  • Are u aware of differences in shells being used by python and what you use on the command line?

No idea. I'm not that great with Python's internals, so I don't know what subprocess actually does at this level.

  • Could we try using the powershell?

Short: it runs ok from powershell. It still does not work from the normal command prompt.

I git cleaned from the powershell.

I rebuilt the test.

I ran the test from the powershell and it was ok.

I ran the same executable from a normal terminal and it was not ok.

  • Could we try using mingw32-make to execute the tests (which uses a different shell)?

I don't understand what that means.

Which means to set these option during compilation with gcc:

-fno-delete-null-pointer-checks and -fno-lifetime-dse

This didn't help. I added this to make/local and completely rebuilt that test: CXX=g++ -fno-delete-null-pointer-checks -fno-lifetime-dse

@riddell-stan
Copy link

Exit code -1073741819 (0xC0000005) is, from what I have read, a memory access violation error.

@riddell-stan
Copy link

I think you can debug this sort of error with windbg or cdb attached to the faulting process.

@riddell-stan
Copy link

@syclik @wds15 is there a single command one can run to get the crash? If there is, perhaps we could find someone with some experience using the Windows debugger to find out more information? @ahartikainen have you used windgb or cdb before?

@wds15
Copy link
Contributor Author

wds15 commented Apr 13, 2019

Well, a unit test is failing under the cmd shell while the test works when called from python or from the PowerShell on Windows. On the other platforms the test works. I suspect a bug in some software, but I doubt that it is Stan related - gcc or googletest are good candidates here. We should try what happens without optimizations from the compiler or even change the test a bit. The test uses a recover memory in the tear down command. Maybe moving that to the startup command fixes the issue already. This is a really weird thing...and I hope very much we get some fix for it, but I am not sure if we will understand the root cause (would be nice if we get there, of course).

If someone has a windows debugger at hand and working...that would be great.

@syclik
Copy link
Member

syclik commented Apr 13, 2019

Yes, there is an example that can be run.

If you want to run it directly:

make test/unit/math/rev/arr/functor/coupled_ode_system_test.exe
test\unit\math\rev\arr\functor\coupled_ode_system_test.exe

That will build first then run.

If you want to use the script:

.\runTests.py test/unit/math/rev/arr/functor/coupled_ode_system_test.cpp

This one is tricky. It might be as @wds15 suggested. Or not. My guess would be that assumptions about when objects are created and destroyed may not be valid, but that's just a guess at this moment until I can verify.

@riddell-stan
Copy link

I suspect it's a mingw specific error. But we are stuck with mingw, so I hope there is a workaround. And I suspect there is a workaround, if it's a mingw issue; many people are using TLS now.

Supporting TLS on Windows is just about the most important thing for PyStan going forward. I'm tempted to find a Windows VPS (Azure?) and debug this myself.

(Should this discussion be moved to the new PR?)

@syclik
Copy link
Member

syclik commented Apr 17, 2019 via email

@wds15 wds15 mentioned this pull request Apr 22, 2019
5 tasks
@serban-nicusor-toptal serban-nicusor-toptal added this to the 2.19.2 milestone Jul 18, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.