-
Notifications
You must be signed in to change notification settings - Fork 105
fix: temporarily use unary_transform instead of segmented_reduce #3814
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
fix: temporarily use unary_transform instead of segmented_reduce #3814
Conversation
|
temporary replacement until NVIDIA/cccl#6171 is fixed. Also, a relevant PR with discussion: #3763 |
Codecov Report❌ Patch coverage is
Additional details and impacted files
🚀 New features to boost your workflow:
|
|
The documentation preview is ready to be viewed at http://preview.awkward-array.org.s3-website.us-east-1.amazonaws.com/PR3814 |
|
Hello @shwina! I'm running into another error related to the NVIDIA/cccl#7121 fix. This time it's in the The easiest way to reproduce this: What is interesting I can't reproduce this error directly on the main branch of cccl. Has it already been fixed? Please, check it out. Here is the full error: |
|
Fix in NVIDIA/cccl#7321. We'll push out a release today with this fix so that you don't have to work off of |
|
and thanks to you too! |
|
Hello @shwina! I'm getting another error :( Do you know what might cause it? I'm using Full error code if you need it: |
|
@maxymnaumchyk -- thanks, could you tell me how to reproduce what you're seeing? I tried the following (on your branch): and it seemed to complete without errors. |
|
I can see I'm clearly missing a step since CI is failing :) |
|
Oh - I see the problem. CI is pulling in Can you try with the constraint edit: In the mean time I'll update our own constraints. |
|
yes, thanks Ashwin! It was indeed the problem in the versions of my packages. I'll take a deeper look into it tomorrow~ |
|
@maxymnaumchyk for now, can you bypass It's cheating, but unfortunately RAPIDS won't be relaxing their numba-cuda pins until their next release :( Alternately, if you think this is more appropriate, that's fine with me too:
|
|
Right now, this implementation works, except very slowly. Running this script: Shows:
|
|
The issue here is that the One fix would be to pass the same We have a better solution for this in |
|
Thanks @shwina! That's good to know. Meanwhile, I'll try to figure out how to pass |
|
With the latest import awkward as ak
import cupy as cp
import timeit
awkward_array = ak.Array([[1], [2, 3], [4, 5], [6, 7, 1, 8], [], [9]], backend = 'cuda')
# first time, ak.argmax:
_ = ak.argmax(awkward_array, axis=1) # warmup
start_time = timeit.default_timer()
for i in range(10):
expect = ak.argmax(awkward_array, axis=1)
cp.cuda.Device().synchronize()
end_time = timeit.default_timer()
print(f"Time taken for ak.argmax: {(end_time - start_time) / 10} seconds") |
|
awesome! do you have a planned release? |
Yes, should be available on pip/conda now! Thanks! |
|
Hello @shwina, there is currently a bug(?) with how argmin will use the precompiled |
|
Thanks @maxymnaumchyk. Yes it's definitely a bug. I'm looking into it. |
FYI, this should be resolved now with the latest release (0.5.1). Regarding the benchmark you're running; can you try: import awkward as ak
import cupy as cp
import timeit
awkward_array = ak.Array([[1], [2, 3], [4, 5], [6, 7, 1, 8], [], [9]], backend = 'cuda')
# first time, ak.argmax:
_ = ak.argmax(awkward_array, axis=1) # warmup
cp.cuda.Device().synchronize() # <--------------------- insert this synchonize
start_time = timeit.default_timer()
for i in range(10):
expect = ak.argmax(awkward_array, axis=1)
cp.cuda.Device().synchronize()
end_time = timeit.default_timer()
print(f"Time taken for ak.argmax: {(end_time - start_time) / 10} seconds")Does that change the numbers you're seeing? |
|
Thanks! Unfortunately not( I still see the same kind of performance.
Is the result consistent for you? I installed the newest |
|
Hmm - yes, I consistently see about ~1ms per iterator (tested just now). What does the nsys profile look like if you do: import nvtx
import awkward as ak
import cupy as cp
awkward_array = ak.Array([[1], [2, 3], [4, 5], [6, 7, 1, 8], [], [9]], backend = 'cuda')
_ = ak.argmax(awkward_array, axis=1) # warmup
cp.cuda.Device().synchronize()
with nvtx.annotate("running argmax..."):
ak.argmax(awkward_array, axis = 1)
cp.cuda.Device().synchronize() |
|
Which version of |
|
I'm using |
|
Hmm, it's hard for me to say what's going on. At such tiny data sizes, the GPU is hardly even going to be utilized. The runtime is dominated by CPU overhead. I'll benchmark (tomorrow) for much larger data sizes which should make things more apparent. Hopefully for those data sizes we'll see comparable results. |
|
Thanks! For reference, these are the results I'm getting: Script code
|
|
@maxymnaumchyk can you point me to where I can get You could also take a look at what's happening on the CPU side with import cProfile
awkward_array = ak.to_backend(ak.from_parquet("random_listoffset.parquet"), 'cuda')
_ = ak.argmax(awkward_array, axis=1) # warmup
cp.cuda.Device().synchronize()
pr = cProfile.Profile()
pr.enable()
start_time = timeit.default_timer()
for i in range(10):
expect = ak.argmax(awkward_array, axis=1)
cp.cuda.Device().synchronize()
end_time = timeit.default_timer()
pr.disable()
pr.dump_stats("argmax.prof")
print(f"Time taken for ak.argmax: {(end_time - start_time) / 10} seconds (~2500 Mb array)")And then in the command-line (you'll have to Here's what I get for randomly generated data: Scriptimport nvtx
import numpy as np
import cupy as cp
import awkward as ak
import timeit
def create_random_listoffsest_array(num_lists):
# Create random jagged data
np.random.seed(42)
mean_len = 10
max_len = 40
# Generate random lengths and build offsets
lengths = np.random.poisson(lam=mean_len, size=num_lists).astype(np.int64)
lengths = np.clip(lengths, 0, max_len)
offsets = np.empty(num_lists + 1, dtype=np.int64)
offsets[0] = 0
np.cumsum(lengths, out=offsets[1:])
total_values = int(offsets[-1])
values = np.arange(total_values, dtype=np.int64)
# Build awkward array
layout = ak.contents.ListOffsetArray(
ak.index.Index64(offsets),
ak.contents.NumpyArray(values)
)
arr = ak.Array(layout)
return ak.to_backend(arr, "cuda")
def sizeof_fmt(num, suffix="B"):
# https://stackoverflow.com/a/1094933
for unit in ("", "Ki", "Mi", "Gi", "Ti", "Pi", "Ei", "Zi"):
if abs(num) < 1024.0:
return f"{num:3.1f}{unit}{suffix}"
num /= 1024.0
return f"{num:.1f}Yi{suffix}"
for num_lists in [100_000, 1_000_000, 10_000_000, 100_000_000]:
gpu_arr = create_random_listoffsest_array(num_lists)
# warmup
expect = ak.argmax(gpu_arr, axis=1)
cp.cuda.Device().synchronize()
start_time = timeit.default_timer()
for i in range(10):
expect = ak.argmax(gpu_arr, axis=1)
cp.cuda.Device().synchronize()
end_time = timeit.default_timer()
print(f"Time taken for ak.argmax: {(end_time - start_time) / 10} seconds (array size: {sizeof_fmt(gpu_arr.nbytes)})") |





No description provided.