A little comparison between R and Kap
Some time ago, I read this article: Why pandas feels clunky when coming from R. In it, the author explains why they feel that R is a much smoother tool than Pandas.
I'm not a familiar with Pandas, but I do know a bit of R, so when I recently implemented some new features in Kap, I decided that reimplementing the examples in the blog post in Kap may be a good way to demonstrate the differences between the languages.
Spoiler: the Kap solutions are shorter, but R has some nice defaults that has to be specified explicitly in Kap. At the end of the day, it all comes down to individual preference.
Loading the dataset
In R, the function read_csv is used to load CSV data. This function automatically parses things that look like numeric values as numbers, while the corresponding function in Kap returns strings. It also does not make an attempt to process the column headers.
purchases ← io:readCsv "purchases.csv"
┌→──────────────────────────────┐
↓ "country" "amount" "discount"│
│ "USA" "2000" "10"│
│ "USA" "3500" "15"│
│ "USA" "3000" "20"│
│ "Canada" "120" "12"│
│ "Canada" "180" "18"│
│ "Canada" "3100" "21"│
...
└───────────────────────────────┘
So, the first thing we want to do is to remove the first row and use it as column labels. The simplest way to do this is to combine these using a fork:
purchases ← (>1↑)«labels»(1↓) purchases
┌───────────┬──────┬────────┐
│ country│amount│discount│
├→──────────┴──────┴────────┤
↓ "USA" "2000" "10"│
│ "USA" "3500" "15"│
│ "USA" "3000" "20"│
│ "Canada" "120" "12"│
│ "Canada" "180" "18"│
│ "Canada" "3100" "21"│
...
└───────────────────────────┘
All the above does is to take the first row (1↑) and turn that into a 1-dimensional array of strings (using <), then drop the first row (using 1↓) and finally pass these two arrays to labels which constructs the final result.
We still have to convert the strings into numbers. The function to do that is ⍎, but we don't want to call it on the first column. This is achieved by running the parsing with under applied on a drop of the first column:
purchases ← ⍎¨⍢(0 1↓) purchases
Now we have the data in the correct format. Perhaps there should be a variant of readCsv that can do all of this automatically. After all, it's a common enough operation that R does it automatically.
Taking the total sum
This is simple enough. Just take the values in the amount column and do a reduction over add:
+/ purchases.amount
Grouping
Kap provides the group function and the key operator when grouping. Here we can use key:
purchases.country +/⌸ purchases.amount
┌→───────────────┐
↓ "USA" 8500│
│ "Canada" 3400│
│ "UK" 480│
│ "France" 500│
│ "Germany" 570│
│"Australia" 600│
│ "Italy" 630│
│ "Spain" 660│
│ "Japan" 690│
│ "India" 720│
│ "Brazil" 460│
└────────────────┘
Deducting the discount is just a calculation prior to the grouping:
purchases.country +/⌸ -/purchases[;1 2]
The above takes the second and third column and performs a reduction over minus. This just takes each value in the first column and subtracts the value in the second column. Since it's a reduction over two values, this just means that we take the first column minus the second column.
Removing outliers
When removing outliers, we will rely on selection. This just means that we'll create a bitmap of the elements we want to keep, and filter out the rest using ⌿.
Here's how we create the bitmap:
(10×stat:median)⍛> purchases.amount
┌→──────────────────────────────────────────────────────────────┐
│1 0 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1│
└───────────────────────────────────────────────────────────────┘
And putting it all together:
{⍵.country +/⌸ +/⍵[;1 2]} ((10×stat:median)⍛> purchases.amount)⌿purchases
The final example is where we are supposed to take the median within each country. This just moves the filter inside the grouping function:
purchases.country {+/ ((10×stat:median)⍛> ⍵.amount) ⌿ -/⍵}⌸ purchases[;1 2]