Feature/faster ad tls v4 #1171

wds15 · 2019-03-21T22:48:49Z

This PR is a continuation of the reverted one from this merged PR: #1135

This PR is derived from develop which I reverted to 883d781 as suggested by @seantalts

This PR speeds up the access to the AD tape whenever thread_local storage (TLS) is needed for the AD tape. In addition programs with TLS now work on Windows with gcc 4.9.3.

Summary

This PR has the main goal to reduce the performance penalty implied by turning on thread capability of stan-math. As it turns out, the proposed implementation works out of the box on Windows with the gcc 4.9.3 compiler which is not the case with the current threading implementation in develop. So we fix a bug which affects many users given the popularity of Windows. The Windows bug is tickled in the commit a7bf778 on Jenkins and it is fixed in the current form of the PR.

The performance benefits of this PR are documented in the previous PR which served as a first prototype. In summary, we loose ~20% on develop when turning on threading while with this PR we reduce this to less than 3% on average.

The key change of this PR is to change how the global AD tape instance is declared and accessed. Currently in develop the global AD tape is declared

for the non-threading case as a static global instance.
for the threading case as a TLS which is declared as a static instance of a static function (Meyer singleton design).

This PR changes this to a global pointer to the AD tape, which is a TLS for threading, and which is initialized to the constant nullptr expression at compile-time. Initialization to a constant expression at compile-time (as opposed to a dynamic expression right now in develop) allows the compiler to use the TLS global pointer more efficiently. Furthermore, the thread_local keyword used to declare TLS can be swapped for the compiler-specific __thread keyword whenever the expression is constant on gnu extension compatible compilers. The __thread keyword has been around much longer than the C++11 thread_local keyword and only the __thread keyword guarantees to the compiler the the variable is (in fact must be) initialized with a constant expression which allows the use of fast TLS code.

The runtime initialisation of the AD tape for the main process is automatically achieved through the instantiation of a global object which manages the life-time of the AD stack (as performed in the new file math/stan/math/rev/core/init_chainablestack.hpp).

Side-effects

There is no change in behavior for programs whenever STAN_THREADS is not defined. If STAN_THREADS is defined, then any C++ client library which uses the stan-math functions (map_rect in particular) will continue to work unchanged. However, C++ clients which use AD in child threads will stop working. The reason is that the AD tape requires initialization in child threads. The initialization of the AD tape can be done by constructing a ChainableStack instance which will trigger AD tape initialization for the thread. The first instantiation of ChainableStack in a given child thread will take ownership of the AD tape. Thus only the first created instance will free the AD tape resource upon destruction.

Tests

test/unit/math/rev/core/thread_stack_instance_test.cpp is a new test and is written to test matters with and without threading. The test is setup to test different things depending on enabling threading or not.

STAN_THREADS NOT defined (no threading). In this case we only have a global AD tape. All child-threads created by a given program will see the same AD tape. Thus initializing the AD tape on the main thread implies that it is initialized on the child-thread, etc.
STAN_THREADS IS defined (threading case). Now all new threads created must see their own TLS of the AD tape pointer which is not initialized at the beginning and is independent of the main thread AD tape when performing AD operations.

In addition I have added to the map_rect_concurrent tests code which enforces that these tests are always run with 4 threads regardless of the STAN_NUM_THREADS definition (which is often not set and this things do not run threaded during testing).

Checklist

Math issue Thread performance penalty #1102 & map_rect with threading broken on windows #1134
Copyright holder: Sebastian Weber

The copyright holder is typically you or your assignee, such as a university or company. By submitting this pull request, the copyright holder is agreeing to the license the submitted work under the following licenses:
- Code: BSD 3-clause (https://opensource.org/licenses/BSD-3-Clause)
- Documentation: CC-BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
the basic tests are passing
- unit tests pass (to run, use: ./runTests.py test/unit)
- header checks pass, (make test-headers)
- docs build, (make doxygen)
- code passes the built in C++ standards checks (make cpplint)
the code is written in idiomatic C++ and changes are documented in the doxygen
the new changes are tested

Remaining TODO

@seantalts will review the code at the implementation level.
We will run these benchmarks after we've caught up to develop (and hopefully after we've made any additional code changes):
- microbenchmark for develop vs this branch without any threading. This will be run on Window, Linux, Mac for the compilers we've said we'd support (one per OS).
- @wds15's non-stiff integrator example for develop vs this branch with threading. This will run on Linux and Mac for the compilers we've said we'd support. (Windows does not work currently on develop.)
If all of the benchmarks show no more than 5% speed regression, then it's acceptable performance-wise. (We have every reason to believe that we'll get a large speed benefit for threading.)
@syclik will run the benchmarks for the non-stiff integrator example by Monday. If he's unable to run the benchmarks by Monday, we will use the microbenchmark results only.

End-to-end performance results

@syclik ran @wds15's benchmark on Mac with clang++ 6 and Linux with g++ 5 using these scripts: syclik/benchmark-stan. These results were generated on the git hash 95844ac using the all pointer design (with an instance() method abstraction).

Summary

This PR is definitely faster for threading (compared to develop with threading)! These are results for 13cf353.

Linux, g++ 5:
- 1 thread: 18% faster
- 2 threads: 11% faster
- 4 threads: 13% faster
Mac, clang++ 6:
- 1 thread: 27% faster
- 2 threads: 25% faster
- 4 threads: 26% faster

Data (time in seconds per run)

os	threads	branch	n	mean	sd	min
linux	1	new	206.9587	2.2625566	203.053	210.831
linux	1	old	255.1863	2.6885338	251.687	260.186
linux	2	new	228.5295	0.4830041	227.657	229.365
linux	2	old	257.9766	1.2591310	256.384	260.209
linux	4	new	156.5684	0.4185725	155.836	157.302
linux	4	old	180.3396	0.1977036	180.090	180.629
mac	1	new	314.1805	11.3612116	301.316	340.606
mac	1	old	427.6057	28.8460120	405.472	506.568
mac	2	new	175.2279	3.5078966	167.983	179.516
mac	2	old	234.5244	6.4574878	226.599	246.992
mac	4	new	102.7276	2.9450846	98.632	107.473
mac	4	old	138.6252	3.7742022	132.455	145.227

(I still wouldn't suggest using this for single-threaded mode. I haven't benchmarked as thoroughly since it's not what this is about, but on Linux, it's about break even, but on Mac, it's about 10% slower.)

This reverts commit 883d781.

…ying on compiler optimizations

syclik · 2019-03-22T04:52:49Z

To recap, there's a bit of a discrepancy between the microbenchmark and the end-to-end results. We still don't know the cause of that and it would make sense for us to have some explanation as to why that could possibly happen.

It would be ideal if we had some automated process to run the benchmarks across the 3 OSes (2 for this particular PR) to make sure that it is indeed faster. At this level of optimization of code, benchmarking results don't translate across OSes, across different compilers, and even across different compiler versions (including minor).

Since the current proposed implementation uses C preprocessor macros, we should try to make it as clear as possible prior to merging. I believe we've all come to the agreement that given the potential improvement to speed, we can deal with the non-standard (relative to Math) code, but we should take maintenance into serious consideration.

If you haven't seen our style guide, it's here: Coding Style and Idioms (github wiki). Just FYI, we're doing what the Google Style Guide explicitly says not to do. This is from the Preprocessor Macros section:

Instead of using a macro to conditionally compile code ... well, don't do that at all (except, of course, for the #define guards to prevent double inclusion of header files). It makes testing much more difficult.

And further down:

But before using a macro, consider carefully whether there's a non-macro way to achieve the same result. If you need to use a macro to define an interface, contact your project leads to request a waiver of this rule.

This code is a lot better than the first and second versions, but I think there are ways we can make it so it's really ok to pull in. If possible, we should try to get the code designed that way. These things tend not to be fixed once in. And that in turn leads to it being difficult for it to be fixed in the future. I'm trying to prevent this bit of code from locking us in and slowing down future development. Especially as we're moving towards parallelization and requiring more fine-grained control over the autodiff stack.

wds15 · 2019-03-22T09:41:04Z

Are we really going down the route of quoting SOPs and all that? I don't hope so! We are all skilled enough and have a good enough sense for what's good and what not.

Wrt. to the ifdefs: My understanding was we are fine with the ifdefs we use right now. These should be used sparingly, of course. However, we have in our own codebase a few exceptions to that (__WIN32__ or __GNUC__, for example). All ifdefs we use right have been agreed by us - we anyway do STAN_THREADS and __GNUC__ was waived as well for good reasons.

Did you have anything specific in mind for the ifdefs? A simplification I could think of is like

// at the top of autodiffstackstorage we do:
#ifdef STAN_THREADS
#ifdef __GNUC__
      #define STAN_THREADS_DEF __thread
#else
      #define STAN_THREADS_DEF thread_local
#endif
#else
  // empty def for no-threading case
   #define STAN_THREADS_DEF
#endif

// then the definitions are done with

static STAN_THREADS_DEF AutodiffStackStorage *instance_;

Wrt. to the benchmarks: We have to my knowledge a single signal which is the MacPro microbenchmark only. It would be good to understand this case. The information missing so far for me is what OS and what exact CPU is in that machine. I do think that we should try to test only on the latest OS on Mac, Mojave. The rationale is that Mojave should have the best fixes for those Intel bugs Spectre and Meltdown. Moreover, we need a common base to compare things if possible to further reduce the combinatorics. Also: I can't debug this at all! I just don't have such a machine. For reference, the compare-pointer.sh script gives on my machine with clang++-6.0 from macports (30 replications):

// develop, no STAN_THREADS
benchmark_autodiff_stack_mean        10349 ns      10346 ns      65401
benchmark_autodiff_stack_median      10305 ns      10304 ns      65401
benchmark_autodiff_stack_stddev        160 ns        157 ns      65401
// faster ad tls v4, no STAN_THREADS
benchmark_autodiff_stack_mean        10852 ns      10847 ns      63979
benchmark_autodiff_stack_median      10756 ns      10753 ns      63979
benchmark_autodiff_stack_stddev        320 ns        313 ns      63979

What would ease testing a little bit is to add the warfarin map_rect example to the performance benchmark suite, I think. This is a good end-to-end test for a map_rect program which we don't have yet. @seantalts Is that an option? Would this help us?

The only design which I haven't yet discussed to my knowledge is the way the life-time of the AD tape is managed. Right now the first instance ever created of ChainableStack will own the instance. This will be for the main process most likely the instance created in stan/math/rev/core/init_chainablestack.hpp. For child-threads the first instance of this class created will own the AD tape and thus tear it down. An alternative solution could be the use of a smart-pointer type of thing. So one would introduce a ref-count to the AD tape and only delete the instance once the ref count goes to zero. I went with the simpler approach as it does the job just fine and we can adapt that at a later stage if needed as I would see this as a very internal design choice irrelevant to any user (given we only support threading right now through map_rect).

Now, what this PR also shows is that the Windows gcc 4.9.3 struggles with a non-pointer design. The behavior I see suggests that not using pointers tickles bugs with this compiler. I don't have a better explanation. Thus going to all-pointer solves this.

So here is what is left from my perspective:

Understand the MacPro microbenchmark thing - who can handle this? I don't have one.
Potentially improve ifdef logic...one suggestion is above outlined - comments on the suggestion are welcome? Other ideas?
Settle on the design of first ownership vs smart pointer thing - opinions?
Run end-to-end benchmark on the final thing once more - maybe via performance cmdstan suite?

@syclik There is one thing weird in the result you report: When you go on linux from 1 to 2 threads the time increases... is that a mistake? It does not make sense to me. Also: What time do you report? We should use the wall-clock time as reported by time, which is the first line which starts with real from time. The reported run-time from cmdstan is the total CPU time which is not what we want.

@seantalts Right now I am getting a test failure (in the Full unit with GPU) which should not happen and is not related to this PR, ideas/suggestions?

clang++-6.0 -Werror -std=c++1y  -Wno-sign-compare   -I lib/opencl_1.2.8  -O3  -I . -I lib/eigen_3.3.3 -I lib/boost_1.69.0 -I lib/sundials_4.1.0/include -I lib/gtest_1.8.1/include -I lib/gtest_1.8.1 -I lib/gtest_1.8.1/include -I lib/gtest_1.8.1      -DBOOST_RESULT_OF_USE_TR1 -DBOOST_NO_DECLTYPE -DBOOST_DISABLE_ASSERTS -DBOOST_PHOENIX_NO_VARIADIC_EXPRESSION  -DSTAN_OPENCL -DOPENCL_DEVICE_ID=0 -DOPENCL_PLATFORM_ID=0 -DCL_USE_DEPRECATED_OPENCL_1_2_APIS  -DGTEST_USE_OWN_TR1_TUPLE -DGTEST_USE_OWN_TR1_TUPLE  -c -o test/unit/math/opencl/sub_block_test.o test/unit/math/opencl/sub_block_test.cpp
In file included from test/unit/math/opencl/opencl_context_test.cpp:2:
./stan/math/opencl/opencl_context.hpp:43:3: error: use of undeclared identifier 'assert'
  assert(values.size() == len);
  ^
1 error generated.
<builtin>: recipe for target 'test/unit/math/opencl/opencl_context_test.o' failed
make: *** [test/unit/math/opencl/opencl_context_test.o] Error 1

rok-cesnovar · 2019-03-22T10:07:45Z

@wds15 the last one (assert error) is my fault, I apologize for the inconvenience. Sean mentioned that he will have a fix up on a PR.

wds15 · 2019-03-22T10:17:48Z

@rok-cesnovar NP... waiting a bit and then merging develop again is something I can handle.

seantalts · 2019-03-22T11:08:37Z

You could also merge this branch in if you wanted to get it testing on Jenkins sooner, shouldn't trigger any merge conflicts or anything. branch is internal/no-assert

syclik · 2019-03-22T12:44:33Z

Thanks! Now I feel like we’re actually moving the discussion to the technical things we need to iron out. I’ll respond when I have a chance later today.

…

On Fri, Mar 22, 2019 at 5:41 AM wds15 ***@***.***> wrote: Are we really going down the route of quoting SOPs and all that? I don't hope so! We are all skilled enough and have a good enough sense for what's good and what not. Wrt. to the ifdefs: My understanding was we are fine with the ifdefs we use right now. These should be used sparingly, of course. However, we have in our own codebase a few exceptions to that (__WIN32__ or __GNUC__, for example). All ifdefs we use right have been agreed by us - we anyway do STAN_THREADS and __GNUC__ was waived as well for good reasons. Did you have anything specific in mind for the ifdefs? A simplification I could think of is like // at the top of autodiffstackstorage we do: #ifdef STAN_THREADS #ifdef __GNUC__ #define STAN_THREADS_DEF __thread #else #define STAN_THREADS_DEF thread_local #endif #else // empty def for no-threading case #define STAN_THREADS_DEF #endif // then the definitions are done with static STAN_THREADS_DEF AutodiffStackStorage *instance_; Wrt. to the benchmarks: We have to my knowledge a single signal which is the MacPro microbenchmark only. It would be good to understand this case. The information missing so far for me is what OS and what exact CPU is in that machine. I do think that we should try to test only on the latest OS on Mac, Mojave. The rationale is that Mojave should have the best fixes for those Intel bugs Spectre and Meltdown. Moreover, we need a common base to compare things if possible to further reduce the combinatorics. Also: I can't debug this at all! I just don't have such a machine. For reference, the compare-pointer.sh script gives on my machine with clang++-6.0 from macports (30 replications): // develop, no STAN_THREADS benchmark_autodiff_stack_mean 10349 ns 10346 ns 65401 benchmark_autodiff_stack_median 10305 ns 10304 ns 65401 benchmark_autodiff_stack_stddev 160 ns 157 ns 65401 // faster ad tls v4, no STAN_THREADS benchmark_autodiff_stack_mean 10852 ns 10847 ns 63979 benchmark_autodiff_stack_median 10756 ns 10753 ns 63979 benchmark_autodiff_stack_stddev 320 ns 313 ns 63979 What would ease testing a little bit is to add the warfarin map_rect example to the performance benchmark suite, I think. This is a good end-to-end test for a map_rect program which we don't have yet. @seantalts <https://github.com/seantalts> Is that an option? Would this help us? The only design which I haven't yet discussed to my knowledge is the way the life-time of the AD tape is managed. Right now the first instance ever created of ChainableStack will own the instance. This will be for the main process most likely the instance created in stan/math/rev/core/init_chainablestack.hpp. For child-threads the first instance of this class created will own the AD tape and thus tear it down. An alternative solution could be the use of a smart-pointer type of thing. So one would introduce a ref-count to the AD tape and only delete the instance once the ref count goes to zero. I went with the simpler approach as it does the job just fine and we can adapt that at a later stage if needed as I would see this as a very internal design choice irrelevant to any user (given we only support threading right now through map_rect). Now, what this PR also shows is that the Windows gcc 4.9.3 struggles with a non-pointer design. The behavior I see suggests that not using pointers tickles bugs with this compiler. I don't have a better explanation. Thus going to all-pointer solves this. So here is what is left from my perspective: 1. Understand the MacPro microbenchmark thing - who can handle this? I don't have one. 2. Potentially improve ifdef logic...one suggestion is above outlined - comments on the suggestion are welcome? Other ideas? 3. Settle on the design of first ownership vs smart pointer thing - opinions? 4. Run end-to-end benchmark on the final thing once more - maybe via performance cmdstan suite? @syclik <https://github.com/syclik> There is one thing weird in the result you report: When you go on linux from 1 to 2 threads the time *increases*... is that a mistake? It does not make sense to me. Also: What time do you report? We should use the wall-clock time as reported by time, which is the first line which starts with real from time. The reported run-time from cmdstan is the total CPU time which is not what we want. @seantalts <https://github.com/seantalts> Right now I am getting a test failure (in the Full unit with GPU) which should not happen and is not related to this PR, ideas/suggestions? clang++-6.0 -Werror -std=c++1y -Wno-sign-compare -I lib/opencl_1.2.8 -O3 -I . -I lib/eigen_3.3.3 -I lib/boost_1.69.0 -I lib/sundials_4.1.0/include -I lib/gtest_1.8.1/include -I lib/gtest_1.8.1 -I lib/gtest_1.8.1/include -I lib/gtest_1.8.1 -DBOOST_RESULT_OF_USE_TR1 -DBOOST_NO_DECLTYPE -DBOOST_DISABLE_ASSERTS -DBOOST_PHOENIX_NO_VARIADIC_EXPRESSION -DSTAN_OPENCL -DOPENCL_DEVICE_ID=0 -DOPENCL_PLATFORM_ID=0 -DCL_USE_DEPRECATED_OPENCL_1_2_APIS -DGTEST_USE_OWN_TR1_TUPLE -DGTEST_USE_OWN_TR1_TUPLE -c -o test/unit/math/opencl/sub_block_test.o test/unit/math/opencl/sub_block_test.cpp In file included from test/unit/math/opencl/opencl_context_test.cpp:2: ./stan/math/opencl/opencl_context.hpp:43:3: error: use of undeclared identifier 'assert' assert(values.size() == len); ^ 1 error generated. <builtin>: recipe for target 'test/unit/math/opencl/opencl_context_test.o' failed make: *** [test/unit/math/opencl/opencl_context_test.o] Error 1 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1171 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAZ_F8Blp70_WUVww_Ds8-vo1WHeje2tks5vZKUxgaJpZM4cChJ9> .

bob-carpenter · 2019-03-22T15:52:57Z

On Mar 22, 2019, at 12:52 AM, Daniel Lee ***@***.***> wrote: To recap, there's a bit of a discrepancy between the microbenchmark and the end-to-end results. We still don't know the cause of that and it would make sense for us to have some explanation as to why that could possibly happen.

The end-to-end results are the ones that matter. This just indicates to me that there's a conceptual error in the way we're microbenchmarking or we're microbenchmarking something that doesn't take up a lot of CPU time to start with.

It would be ideal if we had some automated process to run the benchmarks across the 3 OSes (2 for this particular PR) to make sure that it is indeed faster. At this level of optimization of code, benchmarking results don't translate across OSes, across different compilers, and even across different compiler versions (including minor).

I agree. We can't cover all possibilities, but it'd be nice to have some sanity checks in place so PRs like this one can be expedited.

Since the current proposed implementation uses C preprocessor macros, we should try to make it as clear as possible prior to merging. I believe we've all come to the agreement that given the potential improvement to speed, we can deal with the non-standard (relative to Math) code, but we should take maintenance into serious consideration.

Sorry, I missed what's non-standard here. We use preprocessor macros now for thread safety and for assembler hints with (un)likely.

If you haven't seen our style guide, it's here: Coding Style and Idioms (github wiki). Just FYI, we're doing what the Google Style Guide explicitly says not to do. This is from the Preprocessor Macros section: Instead of using a macro to conditionally compile code ... well, don't do that at all (except, of course, for the #define guards to prevent double inclusion of header files). It makes testing much more difficult.

Yet you see Eigen and Boost do it all over the place, so I don't think it should be completely banned.

And further down: But before using a macro, consider carefully whether there's a non-macro way to achieve the same result. If you need to use a macro to define an interface, contact your project leads to request a waiver of this rule.

Is there a non-macro alternative?

This code is a lot better than the first and second versions, but I think there are ways we can make it so it's really ok to pull in.

Specifics would help here to lay out requirements for being "OK to pull in". I think part of the problem we're having is that the requirements are vague. I know it's impossible to have an automated way to do this, but some more guidance would be really helpful for developers.

If possible, we should try to get the code designed that way.

Sorry, I didn't catch the reference of "that way". What way are you suggesting?

These things tend not to be fixed once in.

Saying "fixed" implies there's something broken to begin with. What specifically do you think is broken? Is it just the use of macros or something deeper?

And that in turn leads to it being difficult for it to be fixed in the future.

I think we have to guard against headaches in the future, but the future is uncertain, so there's only so much we can do to guard against problems.

I'm trying to prevent this bit of code from locking us in and slowing down future development. Especially as we're moving towards parallelization and requiring more fine-grained control over the autodiff stack.

What needs to happen to move you from preventing the code from getting in to accepting it?

syclik · 2019-03-22T17:57:22Z

Just wanted to leave some notes quickly, there are non-macro alternatives:

template specialization
object oriented

But honestly, I'm ok with the macro approach if we cleaned it up a bit. By that, I mean minimizing the number of places where we're switching between the different branches, so it's really an API and two different implementations that are switched based on the preprocessor variable value. We should really doc what that API is.

One easy thing to do within autodiffstackstorage is to actually move all the things that are conditional on the STAN_THREADS variable into one block. Right now, it's spread in two places in the same file and it's hard to scan for the changes.

I don't know about anyone else, but I find this


  static
#ifdef STAN_THREADS
#ifdef __GNUC__
      __thread
#else
      thread_local
#endif
#endif
      AutodiffStackStorage *instance_;

harder to read than

#ifndef STAN_THREADS
    static AutodiffStackStorage *instance_;
#else
#ifdef __GNUC__
      static __thread AutodiffStackStorage *instance_;
#else
      static thread_local AutodiffStackStorage *instance_;
#endif
#endif

…ndence of STAN_THREADS & compiler extension supported

wds15 · 2019-03-22T18:37:12Z

I tried to address your points, @syclik... I think this improves the readability (I have seen such a pattern in other libraries as well, e.g. boost or the Intel TBB do these kind of moves if that's worth anything).

syclik · 2019-03-22T19:56:21Z

Thanks! That looks a lot cleaner!

…

On Fri, Mar 22, 2019 at 2:37 PM wds15 ***@***.***> wrote: I tried to address your points, @syclik <https://github.com/syclik>... I think this improves the readability (I have seen such a pattern in other libraries as well, e.g. boost or the Intel TBB do these kind of moves if that's worth anything). — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1171 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAZ_F9PgWWdSS4WwaaByOyX_r6wE3cRdks5vZSLZgaJpZM4cChJ9> .

seantalts · 2019-03-22T20:13:23Z

I like what @wds15 did a lot! re: the examples in your post, @syclik - I find it easier to read when the differences are highlighted for me. But perhaps the @wds15 approach combines the best of both worlds - just give everything a preprocessor def and then use those in the code wherever there would be a difference. Just like regular programming but with ifdefs :)

syclik · 2019-03-22T20:43:43Z

Complete agree. I think that makes it a lot easier to read. (If I were to be nit picky, I'd pick a different variable name that doesn't have "THREAD" in the name, but I'm ok with this.)

…

On Fri, Mar 22, 2019 at 4:13 PM seantalts ***@***.***> wrote: I like what @wds15 <https://github.com/wds15> did a lot! re: the examples in your post, @syclik <https://github.com/syclik> - I find it easier to read when the differences are highlighted for me. But perhaps the @wds15 <https://github.com/wds15> approach combines the best of both worlds - just give everything a preprocessor def and then use those in the code wherever there would be a difference. Just like regular programming but with ifdefs :) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1171 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAZ_F0Wlmw14uaJoss16SZ9Cpvvhy4mqks5vZTlkgaJpZM4cChJ9> .

wds15 · 2019-03-23T07:45:56Z

@rok-cesnovar Weird...I haven't done anything wrt to the assert, but now it just passes. I like if problems go away without any effort, but that is odd.

rok-cesnovar · 2019-03-23T09:16:23Z

Yeah, its pretty random, since it passed both the PR tests as well as the tests on develop and only showed up on this PR. #1172 is finished, it just needs a re-approval after the last changes, so we wont have to deal with that anymore.

…tls-v4

syclik · 2019-03-24T03:57:18Z

stan/math/prim/mat/functor/map_rect.hpp

 #ifndef STAN_MATH_PRIM_MAT_FUNCTOR_MAP_RECT_HPP
 #define STAN_MATH_PRIM_MAT_FUNCTOR_MAP_RECT_HPP

+#include <stan/math/prim/scal/meta/return_type.hpp>


[minor]
Alphabetize includes. Move this down a few lines.

I am not sure where to move given your comments. I order these things by dependency scal/arr/mat and then alphabetically.

In prim, these should be strictly alphabetical. I know our codebase hasn't caught up yet, but that's what it should be. For rev, you'll want to bring in the core.hpp first, then everything else in alphabetical order (irrespective of scal/arr/mat).

The files that shouldn't follow this: the big includes like stan/math/prim/mat.hpp. These aren't because they need to include traits in the right order.

syclik · 2019-03-24T04:01:30Z

stan/math/rev/core/autodiffstackstorage.hpp

 namespace math {

 /**
 * Provides a thread_local singleton if needed. Read warnings below!


Description line should be updated. This struct now always provides access to the stack using the singleton pattern.

I hope what I changed is fine.

syclik · 2019-03-24T04:46:26Z

stan/math/rev/core/init_chainablestack.hpp

+namespace math {
+namespace {
+
+ChainableStack global_stack_instance_init;


Please add some doc. Just enough to let the user know that this is here for a reason.

stan/math/rev/arr/fun/sum.hpp

stan/math/rev/core/autodiffstackstorage.hpp

syclik · 2019-03-24T05:11:36Z

stan/math/rev/core/autodiffstackstorage.hpp

- * thread_local.  If there is no loss in performance, we can remove
- * this ifdef.
+ * The initialzation of the AD instance is handled by the lifetime of
+ * a AutodiffStackSingleton object. More specifically, the first


This ownership point is important, but it's not as clearly linked to the code (no mention of own_instance_ here) as I think it should be.

If I understand this correctly, the "first instance" if there are no threads is the global one defined in init_chainablestack.hpp?

And if there are threads, this doc doesn't seem right... we don't need to call init() anymore?

Do we really need to link to private members of the class in the doc if the behavior is described? Well, I added some refs now.

Your understanding is correct... but the one in init_chainablestack.hpp does not have to be the first one if the C++ user wants more control over things and links things together so that what is execute first changes, then the first instance will be another one.

I corrected the ill references to init.

syclik · 2019-03-24T05:13:17Z

@wds15, I'll get us set up so the benchmark runs on 3 platforms appropriately.

syclik · 2019-03-24T17:29:52Z

@syclik There is one thing weird in the result you report: When you go on linux from 1 to 2 threads the time increases... is that a mistake? It does not make sense to me. Also: What time do you report? We should use the wall-clock time as reported by time, which is the first line which starts with real from time. The reported run-time from cmdstan is the total CPU time which is not what we want.

@wds15, thanks for pointing that out. I'm reporting the real time. It is weird. I'll run it on the new code and see if that is still there. If you want to see all the numbers: https://github.com/syclik/benchmark-stan/blob/0b12f814bdc9dbd52ca8b38c87faba613c27738d/time.txt

(old = develop, new = feature/faster-ad-tls-v3)

Although the seeds across Mac and Linux were the same, the actual execution is different, so it's hard to compare across the two. What is important is that fixing the seed on a platform, we're computing exactly the same thing and so it's down to how fast we can compute that. You're right: going from 1 to 2 threads takes longer on Linux. That's fine... it happens under both develop and that version. It might be something we look into, but we don't have to do it here.

But this is why we need to run it on all platforms with the compilers we've set. Although reason would say that confirming on one platform should be good enough, it turns out it's not.

wds15 · 2019-03-24T18:04:16Z

@syclik I still think that relative statements should hold most of the time and I would be good with knowing that things speedup only on Mac, for example. This should by all means translate to the other platforms as well. Since we do not have an automated platform I would rather save our time for doing all this manually.

Now, if we would have an automated facility for running things on all platforms we want to look into ... that is preferable, of course; but until that does not exist we can really make decisions without that from my perspective.

... but you seem to be very into the benchmarks for this one - so I assume you run your code once more and do an update with the v4 version of this PR on the hash 13cf353, right?

wds15 · 2019-03-24T18:12:25Z

And I am actually relatively sure that in your Linux 2-thread runs something went wrong. I did run exactly this example with threading and MPI under Linux, see

https://github.com/stan-dev/stancon_talks/blob/master/2018-helsinki/Contributed-Talks/weber/stancon18-master/stancon18_odeWild.pdf

on page 10. There I am getting a ~1.2x speedup for threading.

syclik · 2019-03-25T21:45:44Z

test/unit/math/mix/scal/meta/broadcast_array_test.cpp

 #include <gtest/gtest.h>
 #include <stan/math/fwd/core/fvar.hpp>
 #include <stan/math/rev/core/var.hpp>
+#include <stan/math/rev/core/init_chainablestack.hpp>


@wds15, do you know why this test needs this when the rest don't? It looks like the includes for the test don't actually have a mix header in there; maybe that would fix it?

mix headers probably fix it, yes (but I am not familiar with the mix stuff). It looks to me as if the test does not include the right includes, yes. I do not know mix stuff at all which is why I plugged in the init_chainablestack.hpp.

�Ok. I made an educated guess what the includes should be... but have a look - I am not familiar with mix.

syclik · 2019-04-01T12:30:03Z

Thanks! I just sent you an email off-line.

wds15 · 2019-04-01T14:46:04Z

I can't believe it... we are again stopped at the very same test under Windows. This one breaks now on develop:

StanAgradRevOde.memory_recovery_dv

Any idea? Do we unmerge again?

I have enabled to my best knowledge exactly this test and it did pass.

syclik · 2019-04-01T15:53:54Z

That's not actually where it's stopping. I ran it from the windows box: C:\Users\daniel\math>test\unit\math\rev\arr\functor\coupled_ode_system_test.exe Running main() from lib/gtest_1.8.1/src/gtest_main.cc [==========] Running 18 tests from 1 test case. [----------] Global test environment set-up. [----------] 18 tests from StanAgradRevOde [ RUN ] StanAgradRevOde.coupled_ode_system_dv [ OK ] StanAgradRevOde.coupled_ode_system_dv (0 ms) [ RUN ] StanAgradRevOde.decouple_states_dv [ OK ] StanAgradRevOde.decouple_states_dv (0 ms) [ RUN ] StanAgradRevOde.initial_state_dv [ OK ] StanAgradRevOde.initial_state_dv (0 ms) [ RUN ] StanAgradRevOde.size_dv [ OK ] StanAgradRevOde.size_dv (0 ms) [ RUN ] StanAgradRevOde.memory_recovery_dv [ OK ] StanAgradRevOde.memory_recovery_dv (0 ms) [ RUN ] StanAgradRevOde.memory_recovery_exception_dv It stops there. There's definitely an error there. I don't know what it is yet.

…

On Mon, Apr 1, 2019 at 10:46 AM wds15 ***@***.***> wrote: I can't believe it... we are again stopped at the very same test under Windows. This one breaks now on develop: StanAgradRevOde.memory_recovery_dv Any idea? Do we unmerge again? I have enabled to my best knowledge exactly this test and it did pass. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1171 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAZ_F1Afrsb-jAjKlTVPGZ5nGEpKVekHks5vchusgaJpZM4cChJ9> .

wds15 · 2019-04-01T16:10:08Z

There is something very odd going on... and I almost doubt that this PR is the root cause ... but as there is no other change it definitely looks like, sure.

What really confuses me is that exactly this test was OK on the commit 1031685 . You can have a look here:

http://d1m1s1b1.stat.columbia.edu:8080/blue/organizations/jenkins/Math%20Pipeline/detail/PR-1171/13/pipeline/90

The logs there do include exactly this test and show that it did run just fine... and this is what totally confuses me.

Is there any difference in how the tests are run on develop vs the one I linked in above. Maybe @seantalts has an idea?

syclik · 2019-04-01T16:16:15Z

There is definitely something very odd going on. I agree.

If I run it from the python script, it runs fine. If I just run the executable, it stops. I have no idea what could cause that:

python script:

>.\runTests.py test/unit/math/rev/arr/functor/coupled_ode_system_test.cpp
------------------------------------------------------------
make -j1 test/unit/math/rev/arr/functor/coupled_ode_system_test.exe
make: 'test/unit/math/rev/arr/functor/coupled_ode_system_test.exe' is up to date.
------------------------------------------------------------
test\unit\math\rev\arr\functor\coupled_ode_system_test.exe --gtest_output="xml:test/unit/math/rev/arr/functor/coupled_ode_system_test.xml"
Running main() from lib/gtest_1.8.1/src/gtest_main.cc
[==========] Running 18 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 18 tests from StanAgradRevOde
[ RUN      ] StanAgradRevOde.coupled_ode_system_dv
[       OK ] StanAgradRevOde.coupled_ode_system_dv (0 ms)
[ RUN      ] StanAgradRevOde.decouple_states_dv
[       OK ] StanAgradRevOde.decouple_states_dv (0 ms)
[ RUN      ] StanAgradRevOde.initial_state_dv
[       OK ] StanAgradRevOde.initial_state_dv (0 ms)
[ RUN      ] StanAgradRevOde.size_dv
[       OK ] StanAgradRevOde.size_dv (0 ms)
[ RUN      ] StanAgradRevOde.memory_recovery_dv
[       OK ] StanAgradRevOde.memory_recovery_dv (0 ms)
[ RUN      ] StanAgradRevOde.memory_recovery_exception_dv
[       OK ] StanAgradRevOde.memory_recovery_exception_dv (0 ms)
[ RUN      ] StanAgradRevOde.coupled_ode_system_vd
[       OK ] StanAgradRevOde.coupled_ode_system_vd (0 ms)
[ RUN      ] StanAgradRevOde.decouple_states_vd
[       OK ] StanAgradRevOde.decouple_states_vd (0 ms)
[ RUN      ] StanAgradRevOde.initial_state_vd
[       OK ] StanAgradRevOde.initial_state_vd (0 ms)
[ RUN      ] StanAgradRevOde.size_vd
[       OK ] StanAgradRevOde.size_vd (0 ms)
[ RUN      ] StanAgradRevOde.memory_recovery_vd
[       OK ] StanAgradRevOde.memory_recovery_vd (0 ms)
[ RUN      ] StanAgradRevOde.memory_recovery_exception_vd
[       OK ] StanAgradRevOde.memory_recovery_exception_vd (0 ms)
[ RUN      ] StanAgradRevOde.coupled_ode_system_vv
[       OK ] StanAgradRevOde.coupled_ode_system_vv (0 ms)
[ RUN      ] StanAgradRevOde.decouple_states_vv
[       OK ] StanAgradRevOde.decouple_states_vv (0 ms)
[ RUN      ] StanAgradRevOde.initial_state_vv
[       OK ] StanAgradRevOde.initial_state_vv (0 ms)
[ RUN      ] StanAgradRevOde.size_vv
[       OK ] StanAgradRevOde.size_vv (0 ms)
[ RUN      ] StanAgradRevOde.memory_recovery_vv
[       OK ] StanAgradRevOde.memory_recovery_vv (0 ms)
[ RUN      ] StanAgradRevOde.memory_recovery_exception_vv
[       OK ] StanAgradRevOde.memory_recovery_exception_vv (0 ms)
[----------] 18 tests from StanAgradRevOde (135 ms total)

[----------] Global test environment tear-down
[==========] 18 tests from 1 test case ran. (157 ms total)
[  PASSED  ] 18 tests.

Running the exectuable:

>test\unit\math\rev\arr\functor\coupled_ode_system_test.exe
Running main() from lib/gtest_1.8.1/src/gtest_main.cc
[==========] Running 18 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 18 tests from StanAgradRevOde
[ RUN      ] StanAgradRevOde.coupled_ode_system_dv
[       OK ] StanAgradRevOde.coupled_ode_system_dv (0 ms)
[ RUN      ] StanAgradRevOde.decouple_states_dv
[       OK ] StanAgradRevOde.decouple_states_dv (0 ms)
[ RUN      ] StanAgradRevOde.initial_state_dv
[       OK ] StanAgradRevOde.initial_state_dv (0 ms)
[ RUN      ] StanAgradRevOde.size_dv
[       OK ] StanAgradRevOde.size_dv (0 ms)
[ RUN      ] StanAgradRevOde.memory_recovery_dv
[       OK ] StanAgradRevOde.memory_recovery_dv (0 ms)
[ RUN      ] StanAgradRevOde.memory_recovery_exception_dv

and it stops there.

So far, this is repeatable.

syclik · 2019-04-01T16:22:40Z

@wds15, bad news: this PR changes that behavior.

syclik · 2019-04-01T16:23:35Z

I just went back and ran that exact test using those two commands (runTests and running the executable). No problems running both.

syclik · 2019-04-01T16:41:29Z

Mind reverting the PR? We'll try to get to what's happening, but for now, let's get develop back into a working state. (I won't get back to it for at least a few hours.)

wds15 · 2019-04-01T17:03:35Z

Can we just push revert in this PR? Or do we need a new issue, new PR & another v5 PR for this one?

What I still do not understand is that the test did pass our pipeline in the git hash I quoted above.

Super weird. Maybe @ariddell or @ahartikainen have an idea?

riddell-stan · 2019-04-01T17:26:37Z

We have seen some exotic TLS bugs with specific versions of Mingw-w64. Is that the compiler that Windows is using?

wds15 · 2019-04-01T18:06:52Z

The tests which break are not using STAN_THREADS, but yes... this is the compiler mingw-w64 gcc 4.9.3.

@syclik

Do we have multiple windows test machines and this tests fails on one, but not the other maybe?
Are u aware of differences in shells being used by python and what you use on the command line?
Could we try using the powershell?
Could we try using mingw32-make to execute the tests (which uses a different shell)?

(I am not saying this should be tried in develop which need to revert; just some ideas I would follow).

wds15 · 2019-04-01T18:10:16Z

Only for the record: The test failing now on Windows develop did pass here:

http://d1m1s1b1.stat.columbia.edu:8080/blue/rest/organizations/jenkins/pipelines/Math%20Pipeline/branches/PR-1171/runs/13/nodes/90/steps/125/log/?start=0

StanAgradRevOde.memory_recovery_exception_dv passes just fine there.

riddell-stan · 2019-04-01T18:20:05Z

We've only encountered these unusual bugs with STAN_THREADS turned on, so perhaps our experience isn't useful. One thing that might be helpful: if you can get the Windows exit code associated with the crash you can often learn something about the cause of the crash. (This was new to me since you can't do this with Linux.)

…

On 4/1/19 2:06 PM, wds15 wrote: The tests which break are not using STAN_THREADS, but yes... this is the compiler mingw-w64 gcc 4.9.3. @syclik <https://github.com/syclik> * Do we have multiple windows test machines and this tests fails on one, but not the other maybe? * Are u aware of differences in shells being used by python and what you use on the command line? * Could we try using the powershell? * Could we try using mingw32-make to execute the tests (which uses a different shell)? (I am not saying this should be tried in develop which need to revert; just some ideas I would follow). — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1171 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AmFA1sZekxVbpBvsnQaxleZ4TBrhOWIpks5vckq-gaJpZM4cChJ9>.

syclik · 2019-04-01T18:37:27Z

I just reverted the merge and created a PR.

syclik · 2019-04-02T12:13:43Z

I'll try to grab the windows exit code. @wds15, - I've only tried on one machine. - sure. Are you thinking it's a permissions thing? Maybe Python subprocess has a different shell environment? - where is the other make from? On Mon, Apr 1, 2019 at 2:20 PM riddell-stan <[email protected]> wrote:

…

We've only encountered these unusual bugs with STAN_THREADS turned on, so perhaps our experience isn't useful. One thing that might be helpful: if you can get the Windows exit code associated with the crash you can often learn something about the cause of the crash. (This was new to me since you can't do this with Linux.) On 4/1/19 2:06 PM, wds15 wrote: > The tests which break are not using STAN_THREADS, but yes... this is the > compiler mingw-w64 gcc 4.9.3. > > @syclik <https://github.com/syclik> > > * Do we have multiple windows test machines and this tests fails on > one, but not the other maybe? > * Are u aware of differences in shells being used by python and what > you use on the command line? > * Could we try using the powershell? > * Could we try using mingw32-make to execute the tests (which uses a > different shell)? > > (I am not saying this should be tried in develop which need to revert; > just some ideas I would follow). > > — > You are receiving this because you commented. > Reply to this email directly, view it on GitHub > <#1171 (comment)>, or > mute the thread > < https://github.com/notifications/unsubscribe-auth/AmFA1sZekxVbpBvsnQaxleZ4TBrhOWIpks5vckq-gaJpZM4cChJ9 >. > — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1171 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAZ_F_h51d2hl9wnT2tvxPaagCd6EDteks5vck3WgaJpZM4cChJ9> .

syclik · 2019-04-02T12:13:43Z

Also, we can build without make at all if you think it’s a build problem.

…

On Tue, Apr 2, 2019 at 8:11 AM Daniel Lee ***@***.***> wrote: I'll try to grab the windows exit code. @wds15, - I've only tried on one machine. - sure. Are you thinking it's a permissions thing? Maybe Python subprocess has a different shell environment? - where is the other make from? On Mon, Apr 1, 2019 at 2:20 PM riddell-stan ***@***.***> wrote: > We've only encountered these unusual bugs with STAN_THREADS turned on, > so perhaps our experience isn't useful. > > One thing that might be helpful: if you can get the Windows exit code > associated with the crash you can often learn something about the cause > of the crash. (This was new to me since you can't do this with Linux.) > > On 4/1/19 2:06 PM, wds15 wrote: > > The tests which break are not using STAN_THREADS, but yes... this is > the > > compiler mingw-w64 gcc 4.9.3. > > > > @syclik <https://github.com/syclik> > > > > * Do we have multiple windows test machines and this tests fails on > > one, but not the other maybe? > > * Are u aware of differences in shells being used by python and what > > you use on the command line? > > * Could we try using the powershell? > > * Could we try using mingw32-make to execute the tests (which uses a > > different shell)? > > > > (I am not saying this should be tried in develop which need to revert; > > just some ideas I would follow). > > > > — > > You are receiving this because you commented. > > Reply to this email directly, view it on GitHub > > <#1171 (comment)>, > or > > mute the thread > > < > https://github.com/notifications/unsubscribe-auth/AmFA1sZekxVbpBvsnQaxleZ4TBrhOWIpks5vckq-gaJpZM4cChJ9 > >. > > > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#1171 (comment)>, or mute > the thread > <https://github.com/notifications/unsubscribe-auth/AAZ_F_h51d2hl9wnT2tvxPaagCd6EDteks5vck3WgaJpZM4cChJ9> > . >

wds15 · 2019-04-02T13:31:39Z

That other make is part of RTools and is already on the machine. Everything builds on windows with it ... I am using it already in the TBB PR which I am still working on, but the TBB Windows build works great with mingw32-make (and one of the differences to the vanilla make is the shell being used).

wds15 · 2019-04-03T12:37:58Z

Maybe this is useful:

http://mail.openjdk.java.net/pipermail/build-dev/2016-March/016783.html

Which means to set these option during compilation with gcc:

-fno-delete-null-pointer-checks and -fno-lifetime-dse

... this is wild guess.

syclik · 2019-04-08T13:06:23Z

@wds15 and @ariddell: here's what I get:

C:\Users\daniel\math>echo Error code is %errorlevel%
Error code is -1073741819

It looks like it's a permissions thing, but I'm not sure what that means. And why it's just for this test.

@syclik

Do we have multiple windows test machines and this tests fails on one, but not the other maybe?

I'll use docker and spin one up this week.

Are u aware of differences in shells being used by python and what you use on the command line?

No idea. I'm not that great with Python's internals, so I don't know what subprocess actually does at this level.

Could we try using the powershell?

Short: it runs ok from powershell. It still does not work from the normal command prompt.

I git cleaned from the powershell.

I rebuilt the test.

I ran the test from the powershell and it was ok.

I ran the same executable from a normal terminal and it was not ok.

Could we try using mingw32-make to execute the tests (which uses a different shell)?

I don't understand what that means.

Which means to set these option during compilation with gcc:

-fno-delete-null-pointer-checks and -fno-lifetime-dse

This didn't help. I added this to make/local and completely rebuilt that test: CXX=g++ -fno-delete-null-pointer-checks -fno-lifetime-dse

riddell-stan · 2019-04-08T15:22:01Z

Exit code -1073741819 (0xC0000005) is, from what I have read, a memory access violation error.

riddell-stan · 2019-04-09T13:06:57Z

I think you can debug this sort of error with windbg or cdb attached to the faulting process.

riddell-stan · 2019-04-13T14:44:39Z

@syclik @wds15 is there a single command one can run to get the crash? If there is, perhaps we could find someone with some experience using the Windows debugger to find out more information? @ahartikainen have you used windgb or cdb before?

wds15 · 2019-04-13T15:46:50Z

Well, a unit test is failing under the cmd shell while the test works when called from python or from the PowerShell on Windows. On the other platforms the test works. I suspect a bug in some software, but I doubt that it is Stan related - gcc or googletest are good candidates here. We should try what happens without optimizations from the compiler or even change the test a bit. The test uses a recover memory in the tear down command. Maybe moving that to the startup command fixes the issue already. This is a really weird thing...and I hope very much we get some fix for it, but I am not sure if we will understand the root cause (would be nice if we get there, of course).

If someone has a windows debugger at hand and working...that would be great.

syclik · 2019-04-13T16:39:14Z

Yes, there is an example that can be run.

If you want to run it directly:

make test/unit/math/rev/arr/functor/coupled_ode_system_test.exe
test\unit\math\rev\arr\functor\coupled_ode_system_test.exe

That will build first then run.

If you want to use the script:

.\runTests.py test/unit/math/rev/arr/functor/coupled_ode_system_test.cpp

This one is tricky. It might be as @wds15 suggested. Or not. My guess would be that assumptions about when objects are created and destroyed may not be valid, but that's just a guess at this moment until I can verify.

riddell-stan · 2019-04-17T15:21:48Z

I suspect it's a mingw specific error. But we are stuck with mingw, so I hope there is a workaround. And I suspect there is a workaround, if it's a mingw issue; many people are using TLS now.

Supporting TLS on Windows is just about the most important thing for PyStan going forward. I'm tempted to find a Windows VPS (Azure?) and debug this myself.

(Should this discussion be moved to the new PR?)

syclik · 2019-04-17T15:24:02Z

I don't think it is. I think this has uncovered a bug. I'm still trying to pin down the exact condition; I've spent more than a few hours debugging. I'll ended a few more to get to something repeatable.

…

On Wed, Apr 17, 2019 at 11:21 AM riddell-stan ***@***.***> wrote: I suspect it's a mingw specific error. But we are stuck with mingw, so I hope there is a workaround. And I suspect there *is* a workaround, if it's a mingw issue; many people are using TLS now. Supporting TLS on Windows is just about the most important thing for PyStan going forward. I'm tempted to find a Windows VPS (Azure?) and debug this myself. (Should this discussion be moved to the new PR?) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1171 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAZ_FwQM4SpX4FSYXQGjBzXAc_EdPL2Pks5vhzwNgaJpZM4cChJ9> .

wds15 added 3 commits March 21, 2019 23:40

Revert "Revert "Feature/faster ad tls v3""

7d521fd

This reverts commit 883d781.

make explicit use of pointer to AD tape for faster access without rel…

236b6d9

…ying on compiler optimizations

make windows test part of the PR test pipeline

98e09e3

fix left instance access by method

0bf46b0

wds15 changed the title ~~WIP Feature/faster ad tls v4~~ Feature/faster ad tls v4 Mar 22, 2019

reduce ifdef-branching through definition of STAN_THREADS_DEF in depe…

49df857

…ndence of STAN_THREADS & compiler extension supported

better doc the branching ifdef logic by rearranging

0839904

Merge remote-tracking branch 'origin/develop' into feature/faster-ad-…

d9007f5

…tls-v4

syclik reviewed Mar 24, 2019

View reviewed changes

address reviewer comments

13cf353

syclik reviewed Mar 25, 2019

View reviewed changes

uses standard includes

339d989

syclik mentioned this pull request Apr 1, 2019

Revert "Feature/faster ad tls v4" #1193

Merged

5 tasks

syclik mentioned this pull request Apr 19, 2019

Problem with coupled_ode_system #1210

Closed

wds15 mentioned this pull request Apr 22, 2019

Faster TLS v5 #1212

Merged

5 tasks

serban-nicusor-toptal added this to the 2.19.2 milestone Jul 18, 2019

Uh oh!

Feature/faster ad tls v4 #1171

Feature/faster ad tls v4 #1171

Uh oh!

Conversation

wds15 commented Mar 21, 2019 • edited by syclik Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Side-effects

Tests

Checklist

Remaining TODO

End-to-end performance results

Summary

Data (time in seconds per run)

Uh oh!

syclik commented Mar 22, 2019

Uh oh!

wds15 commented Mar 22, 2019

Uh oh!

rok-cesnovar commented Mar 22, 2019

Uh oh!

wds15 commented Mar 22, 2019

Uh oh!

seantalts commented Mar 22, 2019

Uh oh!

syclik commented Mar 22, 2019 via email

Uh oh!

bob-carpenter commented Mar 22, 2019 via email

Uh oh!

syclik commented Mar 22, 2019

Uh oh!

wds15 commented Mar 22, 2019

Uh oh!

syclik commented Mar 22, 2019 via email

Uh oh!

seantalts commented Mar 22, 2019

Uh oh!

syclik commented Mar 22, 2019 via email

Uh oh!

wds15 commented Mar 23, 2019

Uh oh!

rok-cesnovar commented Mar 23, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

syclik commented Mar 24, 2019

Uh oh!

syclik commented Mar 24, 2019

Uh oh!

wds15 commented Mar 24, 2019

Uh oh!

wds15 commented Mar 24, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

syclik commented Apr 1, 2019

Uh oh!

wds15 commented Mar 21, 2019 •

edited by syclik

Loading