Tamas K Papp's blog

Tuesday, January 7, 2014

Simple scripting in Common Lisp

I know enough Bash to read simple scripts, but I am always having a hard time when I am trying to write a script that does something that is more complicated than a one-liner that substitutes arguments into a template. I can eventually figure things out, but I have never bothered to develop my Bash skills, programming in it for me is error-prone and time-consuming.

So recently I have started experimenting with scripting in Common Lisp. It turns out to be very, very easy and convenient if you already know CL. I usually start scripting with a skeleton that looks like this:

(cl:defpackage #:script
  (:use #:cl)
  (:export #:run))

(cl:in-package :script)

;;; code comes here

I run things from SLIME and experiment with the REPL. When I think that I am approaching a solution, I write the main function run:

(defun run ()
  ...)

Then I use the following Makefile to compile it with cl-launch:

myscript: myscript.lisp
        cl-launch --lisp sbcl --file myscript.lisp --dump '!'   \
        --output myscript -i '(script:run)'

You can of course also include ASDF libraries, check the help of cl-launch.

For interfacing with the environment (pathnames, files, OS, etc), I find the UIOP compatibility library really helpful. It is included with ASDF, so most likely it is already on your computer.

Tuesday, December 10, 2013

cl-data-frame pretty printed column summaries

I have been refining printed summaries of data frames for my Common Lisp library cl-data-frame. I found that the following approach works best for me for quick eyeballing of data before any processing or analysis:

Real numbers should be summarized by their range and the three quartiles (25%, 50%, 75%). This provides enough information to assess the variation and the "typical" values of the data.
All other values should be summarized by their count and frequency. This is ideal for categorical data (called "factor" in R), and also for various encodings of missing data.
When the column has both numbers and non-numbers, print both of the above. However, when it has very few distinct numbers, don't use quartiles for numbers, just print the frequencies.

For example,

(dframe:df :a #(nil nil nil 1 1 2 3 "missing" "missing"))

prints as

#<CL-DATA-FRAME:DATA-FRAME (1 x 7)
  :A 3 (43%) x NIL, 2 (29%) x 1, 1 (14%) x 2, 1 (14%) x 3>

while

(dframe:df :a (concatenate 'vector
                           #(nil nil nil "missing" "missing")
                           (clnu:numseq 0 100 :by 1/100)))

prints as

#<CL-DATA-FRAME:DATA-FRAME (1 x 10006)
  :A 10001 reals, min=0, q25=24.9975, q50=50, q75=75.0025, max=100;
     3 (0%) x NIL, 2 (0%) x "missing">

A more realistic example with a dataset I am currently working on that has both numeric, categorical, and missing ("") data:

#<CL-DATA-FRAME:DATA-FRAME (20 x 1082)
  :IDHH 1082 reals, min=7, q25=1351, q50=2548, q75=4073, max=5434
  :IDPERS bits, ones: 1082 (100%)
  :AGE 1082 reals, min=25, q25=41.17526, q50=49.492752, q75=59.814816, max=63
  :GENDER 832 (77%) x "weiblich", 250 (23%) x "maennlich"
  :MARITAL 515 (48%) x "geschieden",
           331 (31%) x "ledig",
           153 (14%) x "verwitwet",
           83 (8%) x "dauernd getrennt lebend"
  :SCHOOL 419 (39%) x "mittlere reife, realschulabschluss",
          338 (31%) x "volksschul-/hauptschulabschluss",
          217 (20%) x "abitur (hochschulreife)",
          87 (8%) x "fachoberschule, fachabitur",
          13 (1%) x "keine angabe",
          8 (1%) x "schule ohne abschluss verlassen"
  :WTTYPN 756 (70%) x "montag bis freitag",
          172 (16%) x "samstag",
          154 (14%) x "sonntag"
  :TMW 1082 reals, min=0, q25=0, q50=4.151436, q75=105, max=740
  :HOURS_MAINJOB 1082 reals, min=0, q25=0, q50=3.0288463, q75=9.53125, max=690
  :HOURS_ADDJOB 1082 reals, min=0, q25=0, q50=1.388621, q75=17.039537, max=450
  :THP 1082 reals, min=0, q25=141.15384, q50=263.13727, q75=386.55173, max=760
  :JUKIGR 664 (61%) x "anderer wert/trifft nicht zu oder kein wert vorhanden",
          165 (15%) x "10 bis unter 15",
          104 (10%) x "6 bis unter 10",
          80 (7%) x "18 bis unter 27",
          53 (5%) x "15 bis unter 18",
          16 (1%) x "27 und älter"
  :LEISURE 1082 reals, min=70, q25=437.74194, q50=601.8919, q75=752.069, max=960
  :USUAL_HOURS 1082 reals, min=300, q25=25138.87, q50=82259.11, q75=99997.82,
               max=99999
  :HHTYPE 664 (61%) x 1, 418 (39%) x 2
  :WORK bits, ones: 285 (26%)
  :MAINWAGE 1078 reals, min=0, q25=0, q50=36.24454, q75=153.93013, max=4100;
            4 (0%) x ""
  :ADDWAGE 1063 reals, min=0, q25=0, q50=6.7724867, q75=34.89418, max=1250;
           19 (2%) x ""
  :WAGE 1059 reals, min=0, q25=0, q50=16.293531, q75=49.222637, max=4100;
        23 (2%) x ""
  :NONWAGE_INCOME 1082 reals, min=0, q25=619.7917, q50=935.9649, q75=1293.9656,
                  max=4800>

This is the first time I used CL's pretty printer, so there might be a few bugs in there. The code is in the repository, you also need to update cl-num-utils.

Sunday, March 31, 2013

Deprecated libraries removed

I had some Common Lisp libraries on Github that I have abandonned a while ago, mostly because I have either

found a better library by someone else,
decided to follow alternative approaches, or
write a better version with an incompatible API.

These libraries have been undergoing some quiet bit-rot for a while which makes them unusable without minor or not so minor fixes, for which I have no time; also, there are better libraries out there for the same purpose.

Having abandonned Common Lisp libraries out there is confusing and goes against the principle of consolidating Common Lisp Libraries, so I decided to remove some repositories. In particular, I have removed

cl-2d, which didn't see any new development in the last two years. My recommended replacement is cl-flexplot, which uses PGF (a LaTeX package) as a backend.
cl-text-tables was simply abandonned, I recommend the excellent cl-csv instead.
cl-numlib is superseded by cl-num-utils.

The latter two suggested replacements are in Quicklisp.

If, for some strange reason, anyone is interested in the code of the dead libraries, just write me an e-mail. But I would prefer if these libraries stayed dead, as there are much better replacements out there.

It is very likely that I will remove other deprecated libraries in the future.