[Main website]

dok/paper/dot-types

Petricek, Tomas. “Data Exploration through Dot-Driven Development,” 2017, 27.

Type providers allows to map an expression to some type. They are useful for mapping data providers to OOP and FP languages with static types.

In this paper, the approach is extended for supporting also queries supporting the usual composition of data transformation operations like map, filter, group and so on.

From Examples to Dok

For example, consider the following Python program (using the pandas library), which reads a list of all Olympic medals awarded (see Appendix A) and finds top 8 athletes by the number of gold medals they won in Rio 2016.

This is Python Panda code:

olympics = pd.read_csv("olympics.csv")

olympics[olympics["Games"] == "Rio (2016)"]
  .groupby("Athlete")
  .agg({"Gold" : sum})
  .sort_values(by = "Gold", ascending = False)
  .head(8)

The same code provided by the tool of the paper, having explicit OOP methods for retrieving data, that are parts of the types:

olympics .«filter data»
  .«Games is».«Rio (2016)»
  .then .«group data».«by Athlete».«sum Gold»
  .then .«sort data».«by Gold descending»
  .then .«paging».take(8)

The schema of CSV file is analyzed and proper types are derived. Types are provided lazily because they depends from the expression calculating them.

In Dok initial type is derived, and then other types are derived by static analysis, analyzing the expression. So we have pluggable type inference providers. The IDE can call these plugins for help-completition.

class CSV/Olympics -schema-from("olympics.csv")

var result file("olympics.csv").to(CSV/Olympcs)
  .filter-on-Games-is("Rio (2016)")
  .group-by-Athlete
  .sum-of-Gold
  .sort-by-Gold(with descending True)
  .take(8)

Note that this approach is good for quicly editing simple examples of data transformations. Very complex examples maybe requires a different API. Or MAYBE different SQL-like DSL can be used inside Dok in case of complex queries.

Note that the OOP view of produced data can be mapped to a DBMS runtime, so the logical view of data does not affect its processing.