I have been refining printed summaries of data frames for my Common Lisp library cl-data-frame. I found that the following approach works best for me for quick eyeballing of data before any processing or analysis:
- Real numbers should be summarized by their range and the three quartiles (25%, 50%, 75%). This provides enough information to assess the variation and the "typical" values of the data.
- All other values should be summarized by their count and frequency. This is ideal for categorical data (called "factor" in R), and also for various encodings of missing data.
- When the column has both numbers and non-numbers, print both of the above. However, when it has very few distinct numbers, don't use quartiles for numbers, just print the frequencies.
For example,
(dframe:df :a #(nil nil nil 1 1 2 3 "missing" "missing"))
prints as
#<CL-DATA-FRAME:DATA-FRAME (1 x 7)
:A 3 (43%) x NIL, 2 (29%) x 1, 1 (14%) x 2, 1 (14%) x 3>
while
(dframe:df :a (concatenate 'vector
#(nil nil nil "missing" "missing")
(clnu:numseq 0 100 :by 1/100)))
prints as
#<CL-DATA-FRAME:DATA-FRAME (1 x 10006)
:A 10001 reals, min=0, q25=24.9975, q50=50, q75=75.0025, max=100;
3 (0%) x NIL, 2 (0%) x "missing">
A more realistic example with a dataset I am currently working on that has both numeric, categorical, and missing ("") data:
#<CL-DATA-FRAME:DATA-FRAME (20 x 1082)
:IDHH 1082 reals, min=7, q25=1351, q50=2548, q75=4073, max=5434
:IDPERS bits, ones: 1082 (100%)
:AGE 1082 reals, min=25, q25=41.17526, q50=49.492752, q75=59.814816, max=63
:GENDER 832 (77%) x "weiblich", 250 (23%) x "maennlich"
:MARITAL 515 (48%) x "geschieden",
331 (31%) x "ledig",
153 (14%) x "verwitwet",
83 (8%) x "dauernd getrennt lebend"
:SCHOOL 419 (39%) x "mittlere reife, realschulabschluss",
338 (31%) x "volksschul-/hauptschulabschluss",
217 (20%) x "abitur (hochschulreife)",
87 (8%) x "fachoberschule, fachabitur",
13 (1%) x "keine angabe",
8 (1%) x "schule ohne abschluss verlassen"
:WTTYPN 756 (70%) x "montag bis freitag",
172 (16%) x "samstag",
154 (14%) x "sonntag"
:TMW 1082 reals, min=0, q25=0, q50=4.151436, q75=105, max=740
:HOURS_MAINJOB 1082 reals, min=0, q25=0, q50=3.0288463, q75=9.53125, max=690
:HOURS_ADDJOB 1082 reals, min=0, q25=0, q50=1.388621, q75=17.039537, max=450
:THP 1082 reals, min=0, q25=141.15384, q50=263.13727, q75=386.55173, max=760
:JUKIGR 664 (61%) x "anderer wert/trifft nicht zu oder kein wert vorhanden",
165 (15%) x "10 bis unter 15",
104 (10%) x "6 bis unter 10",
80 (7%) x "18 bis unter 27",
53 (5%) x "15 bis unter 18",
16 (1%) x "27 und älter"
:LEISURE 1082 reals, min=70, q25=437.74194, q50=601.8919, q75=752.069, max=960
:USUAL_HOURS 1082 reals, min=300, q25=25138.87, q50=82259.11, q75=99997.82,
max=99999
:HHTYPE 664 (61%) x 1, 418 (39%) x 2
:WORK bits, ones: 285 (26%)
:MAINWAGE 1078 reals, min=0, q25=0, q50=36.24454, q75=153.93013, max=4100;
4 (0%) x ""
:ADDWAGE 1063 reals, min=0, q25=0, q50=6.7724867, q75=34.89418, max=1250;
19 (2%) x ""
:WAGE 1059 reals, min=0, q25=0, q50=16.293531, q75=49.222637, max=4100;
23 (2%) x ""
:NONWAGE_INCOME 1082 reals, min=0, q25=619.7917, q50=935.9649, q75=1293.9656,
max=4800>
This is the first time I used CL's pretty printer, so there might be a few bugs in there. The code is in the repository, you also need to update cl-num-utils.