Kap performance
Performance, let's talk about it.
Making the language highly performant has never been a primary concern. Rather, the language started as an experiment to see whether it was possible to create a version of APL where results were not computed immediately, but only when they were needed.
This came out of an observation that many simple APL algorithms were very expensive in terms of complexity. Often, some permutation is computed, and then only a small subset of the results are actually used. This then leads to situations where the most straightforward solution is \( O(n^2) \) when it should be \( O(n) \). One such example is using ↑⍷ to match the beginning of a string.
Kap can avoid this by avoiding the computation of these values that are later thrown away.
Additionally, the use of this form of lazy evaluation allows the runtime to perform other types of optimisations based on what is known about the values and the functions being applied on them. For example, monadic ≡ does not have to evaluate all the values if it knows that the argument only contains scalars.
Another benefit of lazy evaluation is that often there is no need to materialise an array to memory. This can help performance since memory accesses tend to be slow.
This approach has a few drawbacks though:
- Primitive operations on arrays that immediately materialise the result are slower than highly optimised non-lazy interpreters (BQN is the shining star here)
- Performance characteristics of complex expressions can be difficult to understand
The point about avoiding allocation of intermediary results does help to offset the performance impact, but the results can often be surprising (both positively and negatively).
Code_report recently posted a video labelled Perf wars: episode 1. In this video he's comparing various languages performing certain low-level operations. Kap fared pretty poorly here, even after a small optimisation was made that improved performance by more than 4 times.
Specifically, I want to talk about the performance of 1⌽, which was used in the video. This function simply rotates a 1-dimensional array by one element. Under no circumstances is this post intended to claim that the test was bad. Maybe it was for some languages, but in the case of Kap, it was fair. Kap most certainly is significantly slower than, say, BQN here.
Let's write a version of this program and benchmark it to see how slow it is:
a ← 1000000 ⍴ 1 0 0
time:runtime { ({1⌽a}⍣1000) 0 }
Total time: 1.601
(Note: to reproduce the numbers in this blog post, use the JVM implementation of Kap, and restart the runtime between each test. In some cases, especially in these tests, the JVM compiler can decide to not optimise certain code paths, giving worse performance for the second test. There should be some way to tune the JVM to handle this better, but the default settings show these problems)
This allocates an array of 1 million elements, and then rotates it 1 step. This is repeated 1000 times in order to get it to take enough time to properly measure.
The time taken for each iteration is taken up by two things:
- Reading every element in the array
- Writing every element to a new array
This happens because ⍣ does not know that the result won't be used, so it simply assumes that the result is needed (to be fair, it could know this, and the optimisation wouldn't be that hard, but generally speaking there is no need to do this, because no one would write such a no-op function in real code).
Now, not only does it need to read each word and then write it into another array. The destination location also has to be recomputed once for every cell, as opposed to computing the offset once and then iterating over all the values. Even though this algorithm has been optimised (which is what gave us the 4 times increase in performance mentioned earlier), it's still no match for the highly optimised vector operations performed by BQN.
So let's make a small change to the expression:
a ← 1000000 ⍴ 1 0 0
time:runtime { ({+/1⌽a}⍣1000) 0 }
Total time: 0.622
The only change here is that instead of simply returning the rotated array, we're computing the sum of all the numbers as well.
Note what happened: We added another operation to sum the results, but it now became twice as fast. This means that the performance of one operation can be affected by what operations come after.
This is of course because now the rotation of the values and the summation can be done at the same time, and there is no longer any need to write the intermediary array to memory. This shows just how difficult it can be to predict the performance of Kap compared to many other languages.
Putting all of this together, we can really take advantage of lazy evaluation by using the fastest lazy value, the result of monadic ⍳, and sum the results. This avoids both creating the initial array and also avoids creating an output array:
time:runtime { +/⍳1000000000 }
Total time: 0.456
499999999500000000
The result is not computed using a formula. It is looping over all values and summing them. Here, Kap is slightly faster than Dyalog, presumably because it can avoid writing intermediary results.
However, what makes it more interesting is if you change the program to the following, the time barely changes:
time:runtime { +/1+⍳1000000000 }
Total time: 0.571
500000000500000000
But, when trying the same on Dyalog, this version doubles the evaluation time. The assumption is that this happens because Dyalog has to write the result to an intermediate array, which Kap can avoid.
There are more interesting performance analysis that can be done when comparing Kap and other languages. Feel free to reach out on: @loke@functional.cafe