Elias Mårtenson

@loke@functional cafe

Some time ago, I read this article: Why pandas feels clunky when coming from R. In it, the author explains why they feel that R is a much smoother tool than Pandas.

I'm not a familiar with Pandas, but I do know a bit of R, so when I recently implemented some new features in Kap, I decided that reimplementing the examples in the blog post in Kap may be a good way to demonstrate the differences between the languages.

Spoiler: the Kap solutions are shorter, but R has some nice defaults that has to be specified explicitly in Kap. At the end of the day, it all comes down to individual preference.

Loading the dataset

In R, the function read_csv is used to load CSV data. This function automatically parses things that look like numeric values as numbers, while the corresponding function in Kap returns strings. It also does not make an attempt to process the column headers.

    purchases ← io:readCsv "purchases.csv"
┌→──────────────────────────────┐
↓  "country" "amount" "discount"│
│      "USA"   "2000"       "10"│
│      "USA"   "3500"       "15"│
│      "USA"   "3000"       "20"│
│   "Canada"    "120"       "12"│
│   "Canada"    "180"       "18"│
│   "Canada"   "3100"       "21"│
...
└───────────────────────────────┘

So, the first thing we want to do is to remove the first row and use it as column labels. The simplest way to do this is to combine these using a fork:

    purchases ← (>1↑)«labels»(1↓) purchases
┌───────────┬──────┬────────┐
│    country│amount│discount│
├→──────────┴──────┴────────┤
↓      "USA" "2000"     "10"│
│      "USA" "3500"     "15"│
│      "USA" "3000"     "20"│
│   "Canada"  "120"     "12"│
│   "Canada"  "180"     "18"│
│   "Canada" "3100"     "21"│
...
└───────────────────────────┘    

All the above does is to take the first row (1↑) and turn that into a 1-dimensional array of strings (using >), then drop the first row (using 1↓) and finally pass these two arrays to labels which constructs the final result.

We still have to convert the strings into numbers. The function to do that is , but we don't want to call it on the first column. This is achieved by running the parsing with under applied on a drop of the first column:

purchases ← ⍎¨⍢(0 1↓) purchases

Now we have the data in the correct format. Perhaps there should be a variant of readCsv that can do all of this automatically. After all, it's a common enough operation that R does it automatically.

Taking the total sum

This is simple enough. Just take the values in the amount column and do a reduction over add:

+/ purchases.amount

Grouping

Kap provides the group function and the key operator when grouping. Here we can use key:

    purchases.country +/⌸ purchases.amount
┌→───────────────┐
↓      "USA" 8500│
│   "Canada" 3400│
│       "UK"  480│
│   "France"  500│
│  "Germany"  570│
│"Australia"  600│
│    "Italy"  630│
│    "Spain"  660│
│    "Japan"  690│
│    "India"  720│
│   "Brazil"  460│
└────────────────┘

Deducting the discount is just a calculation prior to the grouping:

purchases.country +/⌸ -/purchases[;1 2]

The above takes the second and third column and performs a reduction over minus. This just takes each value in the first column and subtracts the value in the second column. Since it's a reduction over two values, this just means that we take the first column minus the second column.

Removing outliers

When removing outliers, we will rely on selection. This just means that we'll create a bitmap of the elements we want to keep, and filter out the rest using .

Here's how we create the bitmap:

    (10×stat:median)⍛> purchases.amount
┌→──────────────────────────────────────────────────────────────┐
│1 0 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1│
└───────────────────────────────────────────────────────────────┘

And putting it all together:

{⍵.country +/⌸ +/⍵[;1 2]} ((10×stat:median)⍛> purchases.amount)⌿purchases

The final example is where we are supposed to take the median within each country. This just moves the filter inside the grouping function:

purchases.country {+/ ((10×stat:median)⍛> ⍵.amount) ⌿ -/⍵}⌸ purchases[;1 2]

Performance, let's talk about it.

Making the language highly performant has never been a primary concern. Rather, the language started as an experiment to see whether it was possible to create a version of APL where results were not computed immediately, but only when they were needed.

This came out of an observation that many simple APL algorithms were very expensive in terms of complexity. Often, some permutation is computed, and then only a small subset of the results are actually used. This then leads to situations where the most straightforward solution is \( O(n^2) \) when it should be \( O(n) \). One such example is using ↑⍷ to match the beginning of a string.

Kap can avoid this by avoiding the computation of these values that are later thrown away.

Additionally, the use of this form of lazy evaluation allows the runtime to perform other types of optimisations based on what is known about the values and the functions being applied on them. For example, monadic does not have to evaluate all the values if it knows that the argument only contains scalars.

Another benefit of lazy evaluation is that often there is no need to materialise an array to memory. This can help performance since memory accesses tend to be slow.

This approach has a few drawbacks though:

  • Primitive operations on arrays that immediately materialise the result are slower than highly optimised non-lazy interpreters (BQN is the shining star here)
  • Performance characteristics of complex expressions can be difficult to understand

The point about avoiding allocation of intermediary results does help to offset the performance impact, but the results can often be surprising (both positively and negatively).

Code_report recently posted a video labelled Perf wars: episode 1. In t