I have been refining printed summaries of data frames for my Common Lisp library cl-data-frame. I found that the following approach works best for me for quick eyeballing of data before any processing or analysis:
- Real numbers should be summarized by their range and the three quartiles (25%, 50%, 75%). This provides enough information to assess the variation and the "typical" values of the data.
- All other values should be summarized by their count and frequency. This is ideal for categorical data (called "factor" in R), and also for various encodings of missing data.
- When the column has both numbers and non-numbers, print both of the above. However, when it has very few distinct numbers, don't use quartiles for numbers, just print the frequencies.
For example,
(dframe:df :a #(nil nil nil 1 1 2 3 "missing" "missing"))
prints as
#<CL-DATA-FRAME:DATA-FRAME (1 x 7) :A 3 (43%) x NIL, 2 (29%) x 1, 1 (14%) x 2, 1 (14%) x 3>
while
(dframe:df :a (concatenate 'vector #(nil nil nil "missing" "missing") (clnu:numseq 0 100 :by 1/100)))
prints as
#<CL-DATA-FRAME:DATA-FRAME (1 x 10006) :A 10001 reals, min=0, q25=24.9975, q50=50, q75=75.0025, max=100; 3 (0%) x NIL, 2 (0%) x "missing">
A more realistic example with a dataset I am currently working on that has both numeric, categorical, and missing ("") data:
#<CL-DATA-FRAME:DATA-FRAME (20 x 1082) :IDHH 1082 reals, min=7, q25=1351, q50=2548, q75=4073, max=5434 :IDPERS bits, ones: 1082 (100%) :AGE 1082 reals, min=25, q25=41.17526, q50=49.492752, q75=59.814816, max=63 :GENDER 832 (77%) x "weiblich", 250 (23%) x "maennlich" :MARITAL 515 (48%) x "geschieden", 331 (31%) x "ledig", 153 (14%) x "verwitwet", 83 (8%) x "dauernd getrennt lebend" :SCHOOL 419 (39%) x "mittlere reife, realschulabschluss", 338 (31%) x "volksschul-/hauptschulabschluss", 217 (20%) x "abitur (hochschulreife)", 87 (8%) x "fachoberschule, fachabitur", 13 (1%) x "keine angabe", 8 (1%) x "schule ohne abschluss verlassen" :WTTYPN 756 (70%) x "montag bis freitag", 172 (16%) x "samstag", 154 (14%) x "sonntag" :TMW 1082 reals, min=0, q25=0, q50=4.151436, q75=105, max=740 :HOURS_MAINJOB 1082 reals, min=0, q25=0, q50=3.0288463, q75=9.53125, max=690 :HOURS_ADDJOB 1082 reals, min=0, q25=0, q50=1.388621, q75=17.039537, max=450 :THP 1082 reals, min=0, q25=141.15384, q50=263.13727, q75=386.55173, max=760 :JUKIGR 664 (61%) x "anderer wert/trifft nicht zu oder kein wert vorhanden", 165 (15%) x "10 bis unter 15", 104 (10%) x "6 bis unter 10", 80 (7%) x "18 bis unter 27", 53 (5%) x "15 bis unter 18", 16 (1%) x "27 und älter" :LEISURE 1082 reals, min=70, q25=437.74194, q50=601.8919, q75=752.069, max=960 :USUAL_HOURS 1082 reals, min=300, q25=25138.87, q50=82259.11, q75=99997.82, max=99999 :HHTYPE 664 (61%) x 1, 418 (39%) x 2 :WORK bits, ones: 285 (26%) :MAINWAGE 1078 reals, min=0, q25=0, q50=36.24454, q75=153.93013, max=4100; 4 (0%) x "" :ADDWAGE 1063 reals, min=0, q25=0, q50=6.7724867, q75=34.89418, max=1250; 19 (2%) x "" :WAGE 1059 reals, min=0, q25=0, q50=16.293531, q75=49.222637, max=4100; 23 (2%) x "" :NONWAGE_INCOME 1082 reals, min=0, q25=619.7917, q50=935.9649, q75=1293.9656, max=4800>
This is the first time I used CL's pretty printer, so there might be a few bugs in there. The code is in the repository, you also need to update cl-num-utils.
No comments:
Post a Comment