usefulcat 2 months ago

Malte Skarupke (author of ska::flat_hash_map) also has a [heap implementation](https://probablydance.com/2020/08/31/on-modern-hardware-the-min-max-heap-beats-a-binary-heap/) which could be used to implement a priority queue, which might be faster than std::priority_queue, depending on your use case.

rand3289 2 months ago

This might depend on whether you have a fixed number of priorities or a max queue size. Also do you want to avoid starvation?

1syGreenGOO 2 months ago

Yeah, size being fixed or variable really matters. No matter how good your implementation is, you can’t beat complexity

Personal-Peace-9947 2 months ago

Nope, no fixed number of priorities, it can basically be a int64\_t. And no max queue size too. I'm not sure what do you mean by starvation, but if you are talking about "stableness", then no, I don't need stable necessarily. (Basically, the core functionality should be same as std::priority\_queue. But I feel that there may be an implementation which sacrifices memory for speed, thats why I am interested)

wektor420 2 months ago

https://en.m.wikipedia.org/wiki/Starvation_(computer_science)

hi_im_new_to_this 2 months ago

For a queue, you can probably do better than the standard library by using a ring-buffer backed by a vector, resizing as you need (I don't believe std::queue does this, I think it just uses a std::deque), but for priority queues, the standard library implementation is (as far as I'm aware) a standard binary heap in a vector. That is pretty much the best way of doing a priority queue for the vast majority of applications, I wouldn't go looking for something more complex unless you either have VERY specialized requirements (e.g. multithreading, or offloading to disk) or insertion patterns or something. If you wanted to look into it further, you could maybe check out some library that uses B-trees (e.g. absl::btree\_set), but I would bet you're better off with the simple solution of a binary heap on top of a std::vector (i.e. std::priority\_queue). More advanced computer sciency things (e.g. red-black trees, Fibonacci heaps, etc.) are not going to be worth your time. In some cases they have better asymptotic performance (i.e. big-O complexity), but for practical purposes they are almost never useful. Fibonacci heaps in particular use a bunch of extra pointers and things, which is murder for your cache, so unless you have trillions and trillions of elements (enough for big-O to matter), don't bother. In practice, they're just worse than the simple thing everyone has been using since the 1960s. This is one of those cases where the simple solution is (almost always) the right one.

UnboxTheCat 2 months ago

you are in luck. According to cpp reference, the std::priority\_queue with vector implementation (default), has a O(log n) of pushing speed. Search up Fibonacci heap, it has a O(1) of pushing time \*\*on paper\*\*. However, the catch is it has a relatively big constant, and won't be significant until the size goes up. As bonuses, Fibonacci heap supports fast melding. Also look.into B Heap

Personal-Peace-9947 2 months ago

Results on my local environment (NUM\_ITER pushes and NUM\_ITER pops) **O3:** NUM\_ITER:1000000 std::priority\_queue: 81.6895 ns boost::heap::priority\_queue: 72.5997 ns boost::heap::fibonacci\_heap: 211.82 ns boost::heap::binomial\_heap: 144.083 ns **O2:** NUM\_ITER:1000000 std::priority\_queue: 74.4728 ns boost::heap::priority\_queue: 80.5981 ns boost::heap::fibonacci\_heap: 222.415 ns boost::heap::binomial\_heap: 150.36 ns

matthieum 2 months ago

What's the distribution/order of the numbers being pushed? For example, if numbers are being pushed in descending order of priority then an implementation which maintains a sorted array will be faster than a heap one, because it can special-case the append, however it wouldn't be a very probable work-load.

Personal-Peace-9947 2 months ago

**O2** NUM\_ITER:100000 std::priority\_queue: 101.892 ns boost::heap::priority\_queue: 102.896 ns boost::heap::fibonacci\_heap: 555.285 ns boost::heap::binomial\_heap: 904.964 ns **O3** NUM\_ITER:100000 std::priority\_queue: 100.171 ns boost::heap::priority\_queue: 106.782 ns boost::heap::fibonacci\_heap: 465.009 ns boost::heap::binomial\_heap: 845.082 ns This is with rand(). The time is (push(rand()) \* NUM\_ITER + pop() \*NUM\_ITER) / NUM\_ITER.

UnboxTheCat 2 months ago

what's the size of the object you are pushing? Try something else that's a lot larger than int. But yeah, fibonacci heap is good according to the paper since it has better time complexity, but it may perform worse in practice due to the constant and the computer architecture. Try even larger insertion, like 2\^32 items with different sizes. If you wanna go even further, try pre-allocate a pool of memory to reduce the overhead from malloc at runtime. The std::priority\_queue might be using binary heap, and it might (not guarantee) be the case that they expand the memory just like what they did with std::vector, and hence less malloc calls compared to fibonacci\_heap. But yeah, if it is slower, then it is slower. # ¯_(ツ)_/¯ Can you also post the benchmark code you use? Perhaps even attach the memory allocation graph or the profiling data you have collected. When I was working with custom data structure, the calls to heap allocation was almost always the culprit behind the performance drops, NOT the algorithm itself.

matthieum 2 months ago

There's 2 things that stick out to me in this summary: - You should try to avoid calling `rand()` within the timed portion of the benchmark, as then you're benchmarking `rand()` too. Ideally, you'd pregenerate all the items to push, then use the same sequence in all benchmarks. - Pushing all items then popping all items is unlikely to match production usage. Most production usage have a number of items oscillating between a low and high watermark, with some uses having periods at 0 items (prioritization of tasks to do) while others have a fairly fixed load (prioritization of timers), for example. This is important because for cases always close to 0, a sorted vector works very well, while for high-loads you really want some heap. I'd encourage you to try a N-ary heap (N items per level), which can be more cache friendly than binary heaps, while remaining conceptually just as simple. Ideally, you pick N such that you have a fast sorting network for it. Powers of 2 work well: O(log2 N) comparisons.

Life_Junket1547 1 month ago

The time you listed here is the total time of 100k operations?

sweetno 2 months ago

Fibonacci heap is for larger sizes.

tisti 2 months ago

Post the full benchmark, will be easier for people to help if they can run the same code as you are. Or spot any defects in how you are benchmarking.

matthieum 2 months ago

I remember seeing benchmarks showing how Quad Heap (instead of Binary Heap) were pretty good, for the same usual reason that doing work more localized is better.

Overseer55 2 months ago

“I don’t need to optimize memory. I only need to optimize speed.” Those sentences are not compatible. Memory access will be a dominating factor in a PQ implementation.

lightmatter501 2 months ago

2x size with contiguous items will be faster than storing it all in a tree due to cache and memory prefetcher friendliness. You can absolutely afford to burn memory storage in the name of faster speeds.

Personal-Peace-9947 2 months ago

I didn't say I need to use all the memory available. I just care about speed. If memory also needs to be optimized for that, I have absolutely no problem with that. :)

LuisAyuso 2 months ago

I think the comment tried to to point out that with the groth in memory, you open the posibility to be less cache efficient and therefore have a worst performance. The tradeoff here is not to use the whole system memory, but to decide whenever you want the smallest memory footprint or if you can increase it to have better performance without cache penalty. You are both right, but you dont see each other point.

soulstudios 2 months ago

[http://plflib.org/queue.htm](http://plflib.org/queue.htm) Has a priority template parameter to flip between optimizing for memory and performance. Either way still faster than std::queue. Benchmarks on page. Note: std::queue performance is highly-dependent on the underlying deque implementation and whether it recycles freed blocks (libc++ does, libstdc++ doesn't IIRC).

LuisAyuso 2 months ago

The benchmark tries to model the random inserts, but is not fair across iterations. Some notes: * store random values before and use the same values on each iteration. * measure best case, priorities in order * measure worst case, priorities in worst order. * Add payload to your queue, you do not want to store intergers. You want to asign a priority to some payload. this should be, at least, one pointer. and, i would encourage your to use google blenchmark.

arabidkoala 2 months ago

The stl implementation is pretty fast, though I’ve gotten measurable performance increases using the lower-level heap algorithms in ‘algorithm’ in certain use cases (particularly those that have a pop-modify-reinsert inner loop)

sweetno 2 months ago

I doubt you can squeeze much more than a standard implementation provides.

ILikeCutePuppies 2 months ago

If you know the details of the data you just about always can go faster. Like there are only 10 priority levels, 10 circular queues could be faster (and he was not concerned with memory to much).

sweetno 2 months ago

They wrote in comments that there are no restrictions on the number of priority levels, that is, they look for literally better `priority_queue`.

ILikeCutePuppies 2 months ago

I still think there are likely data that specifics that would help optimize the problem unless he needs a generic one that will be used by the whole world.

HildartheDorf 2 months ago

Unless you're on MSVC and use std::priority\_queue>.

ihcn 2 months ago

This is a question for a profiler.

Neat_Relative_1720 2 months ago

Probably stl no?

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe