[Main website]

dok/compiler

Overview of compilation tecniques

  • attribute grammars (AG): from language S to language D
  • term-rewriting (TR), equivalent transformation: from language S to language S
  • supercompilation (SC) / superanalysis: apply a series of term-rewriting transformation in an optimal way
  • code-analysis (CA): RML/Crocopat/BDD queries on code
  • specialization (SP) / partial evaluation
  • abstract interpretation
  • constraints, pattern matching, logical languages: select part of code to transformation
  • static typing / design by contract: check at compile time for errors
  • code quality testing (CC = code-contracts): query on code for searching for known defect (sql injection, changing of some parts, without effects on others). Probably they can be coded using a RML/Crocopat approach

Optimizations

  • fusion / deforestation
  • strictification
  • from second order to first order
  • equivalent transformations

Many of these optimizations can be done in an efficient way using AG.

Technologies

  • BDD, Crocopat
  • Rete
  • Liquid Haskell

AG are very good for:

  • translating from a language S to a language D,
  • applying and propagating local changes/rewrites
  • strictification and deforestation of FP code

TR and SC are goods for optimizing a language D into an equivalent form of D, but more efficient, because they accept complex conditions on code, like queries, and not like directly calculated attributes.

CA are very good for checking code properties, understanding it, navigating, because they answers to queries quickly.

SP is very good for inlining and specialization of generic code and libraries, according their real usage in an application.

Transforming a lazy language into an AG equivalent form, and applying some paper, is a good way for strictification. TODO check for fusion

SC and TR is good for function inlining, and SP.

SC and inlining tends to:

  • be slow
  • increase dimensions of code

SP is used for combining application code, with generic libraries, producing an ad-hoc unique application, with only static code.

  • lot slower than full dynamic libraries
  • resulting application usually is not so big (surprislying small)
  • resulting application is fast
  • unikernel approach
  • shared dynamic libraries can use the same RAM for different applications (container approach)
  • JVM by default use a SP approach for many libraries at run-time
  • in part is a repeated work in case of JVM usage

TR: it is risky applying all possible transformations, without some plan. SC allows to apply TR in a controlled but automatic way:

  • slow for big code
  • adapt for optimization of small modules

AG can be:

  • compiled to strict evaluating forms that are rather efficients.
  • optimized (fusion etc..)
  • compiled to incremental upgrade forms

Using AG for optimizing FP code

From functional to AG for more efficient compiled code:

  • [1]S. D. Swierstra, «Strictification of lazy functions», pagg. 1–12.
  • [1]J. Saraiva, D. Swierstra, e M. Kuiper, «Functional Incremental Attribute Evaluation», in Compiler Construction, vol. 1781, D. A. Watt, A c. di Berlin, Heidelberg: Springer Berlin Heidelberg, 2000, pagg. 279–294.
  • [1]P. Parigot, «Attributes grammars: a functional declarative language», Bulletin of Sociological Methodology/Bulletin de Méthodologie Sociologique, vol. 37, n. 1, pagg. 55–57, 1995.
  • [1]M. F. Kuiper e S. D. Swierstra, «Using attribute grammars to derive e cient functional programs», pag. 15.
  • others saved on Zotero

[1]N. Mitchell, «Transformation and Analysis of Functional Programs», pag. 225. converte programmi functional second order, in first order e poi permette ottimizzazioni. Estende tecniche di supercompilation. Fa inline e specialization delle functions.

“Stream Fusion - From Lists to Streams to Nothing at All” by Stewart, Coutts powerful stream fusion tecnique

“Recicly Your Arrays” by Leschinsky transform functional code into imperative code with in-place array modification

“Two for the price of one: a model for parallel and incremental computation” by Sadowski metodo che unisce memoization, parallel, incremental update, con speedup anche di 37x su CSS processing e altri ALGO, non di Sweistra

“Secrets of the Glasgow Haskell Compiler Inliner” by Simon Peyton Jones practical tricks used for inlining functions

Flow-Directed Lightweight Closure Conversion by Siskind escape analysis

RACR - attribute grammar library under Racket

https://github.com/christoff-buerger/racr

Reference Attribute Grammar with Rewritings FACT released under permissive MIT license

Graph rewriting is used for transforming data, and References AG for querying and they are interwinded, i.e. the analysis suggests what to rewrite, and the rewrite generate incremental evaluation. The algo had to be dynamic because in these cases static analysis is not effective/useable.

FACT usage of references then there are abstract syntax graphs instead of abstract syntax trees

FACT rewrite functions are also explicit deleteElement or replaceSubtree. With value based AG one had to simulate this copying all elements except the element to delete, or creating a Dok !replaceSubtree like function creating a clone of a value without the element.

FACT attributes definitions are similar to Dok:

  • automatic copy propagation
  • OO-like inheritance support of common definitions of attributes

FACT there is an example using Racket and Racket GUI and using RACR for dynamically calculating widgets to display or not, acting like a MVC but using only RACR. Very similar to what I want to obtain with DokMelody GUI. https://github.com/christoff-buerger/racr/blob/master/examples/examples-overview.md

TODO check if this approarch needs widgets as references and not as values

TODO RACR and JastAdd that are references AG uses name analysis as example. So try to express using Dok-AG the same thing for seeing if values can be used instead.

Type + Design by Contracts static-typingdesign-by-contractformal-specificationformal-verificationuse

[1]J. Gronski e C. Flanagan, «Unifying Hybrid Types and Contracts», pag. 16.

Persistent Saving of Compilation Passages

Maybe it is better saving on a fast RAM database like LMDB the compilation passages, like AST, and so on. The benefits are:

  • can compile code bigger than RAM
  • can store all intermediate results, directly accessible, for incremental development and compilation
  • compatible with the IDE of an image
  • I can create different indexes (BDD, and so on…) used from the IDE for searching in the code
  • the index approach is similar to DBMS
  • this is feasible only if the overhead of LMDB is acceptable respect a direct storage in RAM of data-structures

TODO check the differences in speed for acessing LMDB vs RAM, and in case it is too much slow, store on LMDB only IDE access data structures

Compiler-plugins

[1]P. Surana, «Meta-Compilation of Language Abstractions», pag. 212.

A Scheme compiler written using AG, and that is extensible from users, with new language constructs. Useful paper because it shows a lot of AG rules, for implementinng rewriting optimizations:

  • dead-code elimination
  • constant propagation

Probably it a validation of the point that AG are composable in the compiler domain, and I hope also in other domains.

FACT for UHC team the type rules are boring to implement using AG, so they are using Ruler that is a declarative language for typing-rules, and then Ruler are converted to AG by the Ruler compiler. But this approach is compatible with DSL built on top of other DSL and macros.

[1]A. Chlipala, «The bedrock structured programming system: combining generative metaprogramming and hoare logic in an extensible program verifier», 2013, pag. 391.

Macro system with combinable macros:

  • The bedrock structured programming system: combining generative metaprogramming and hoare logic in an extensible program verifier
  • macro are applied if some preconditions are meet

Term Rewriting and Super-compilation

[1]T. L. Veldhuizen, «Active Libraries and Universal Languages», pag. 206.

Describe super-analysis that is a method for combining in a correct way two or more rewriting rules. This because a bad order of rewriting rules can generate slower code. It solves the phase ordering problem.

It can be used also for injecting in the code proof-based/domain-specific high-level optimizations, like numeric tricks, numerical theorems, etc..

The system in the paper is generative, but the rewriting rules can be based on simple term-rewriting.

Super-analysis can be too much slow.

Super-analysis and super-compilation is:

  • rewrite the same language L
  • transformation rules are logical (Crocopat like)

Transformation and Analysis of Functional Programs [1]N. Mitchell, «Transformation and Analysis of Functional Programs», pag. 225.

Composing Dataflow Analyses and Transformations Sorin Lerner David Grove Craig Chambers Univ. of Washington IBM T.J. Watson Research Center Univ. of Washington lerns@cs.washington.edu groved@us.ibm.com chambers@cs.washington.edu

Efficient super analysis used from GHC, for simulating in an efficient way supercompilation.

Crocopat/RML

Crocopat punta su un linguaggio simile Prolog, le relations, e BDD per risolverle in maniera efficiente RML (Crocopat) language

Efficient Relational Calculation for Software Analysis Dirk Beyer, Member, IEEE, Andreas Noack, Member, IEEE Computer Society, and Claus Lewerentz

Context-Sensitive Program Analysis as Database Queries Monica S. Lam John Whaley

  1. Benjamin Livshits

Michael C. Martin Dzintars Avots Michael Carbin Christopher Unkel

RML is based on BDDD data structures.

MAYBE BDDD can perform long static analysis (in minutes) so they are more for refactoring than for compilation.

MAYBE BDDD is good for pointer aliasing, but maybe it is not needed in Dok because it is value based

TODO these crocopat rules, can be used for adding management contracts to the code. Rules like:

  • if you change class A then check also if the constraint is added to class B
  • if you change this, then clippy should suggest this
  • it should analyze not only the static code, but the dynamic transformations introduced from patches
  • etc…
  • generic rules can be addd for checking if an application is correct from common bugs

LGPL license.

Scritto in C/C++ mentre bddbddb is scritto in Java e sembra forse un po’ piu` lento (20 minuti invece che 10 minuti per big-analysis).

Ecco un esempio di un programma RML (CorcoPat) che determina quando c’è degenerate-inheritance, supponendo di avere nella base di dati tutte le relations che collegano i vari packages.

DegeneratingInheritance(super, sub, degsub) := Inherit(degsub, sub) & Inherit(degsub, super) & TC(Inherit(sub, super));

dove `TC` is un combinator built-in che calcola la transitive-closure di binary relations.

Refinement types

Refinement types are types with predicates. I prefer refinement types against dependent types because simple types + conttracts are more readable, and because it can be seen as a compile-time extension of design-by-contract that is instead used at run-time.

TODO study:

TODO compare with abstract interpretation where the type of an expression is derived also from its usage, but probably in this case we are not comparing a type specification (requirments/specifications) with its correct usage, and so it is more a form of optimization, because we restict the run-time complexity of code.

TODO in GRIN di Haskell c’e` per UHC una ottimizzazione che annota il codice GRIN con constraints sui tipi. Vedere se posso usarlo per aggiungere design by contract.

Type Checking

Ruler Approach

[1]A. Dijkstra e S. D. Swierstra, «Ruler: Programming Type Rules», in Functional and Logic Programming, vol. 3945, M. Hagiya e P. Wadler, A c. di Berlin, Heidelberg: Springer Berlin Heidelberg, 2006, pagg. 30–46.

Ruler is type-checking language, used in EHC project. Using Ruler you can specify typing rules one time, and then it derives:

  • AG code
  • latex accademic formalism

They said that using only AG can create code a little too much complex in some cases. This is their first version, using only AG: “Efficient relational calculation for software analysis” by Beyer

OCAML Type Inference static-typing

http://okmij.org/ftp/ML/generalization.html

Correctness by Design/Construction

“From the 1970s, Dijkstra’s chief interest was formal verification. The prevailing opinion at the time was that one should first write a program and then provide a mathematical proof of correctness. Dijkstra objected noting that the resulting proofs are long and cumbersome, and that the proof gives no insight on how the program was developed. An alternative method is program derivation, to “develop proof and program hand in hand”. One starts with a mathematical specification of what a program is supposed to do and applies mathematical transformations to the specification until it is turned into a program that can be executed. The resulting program is then known to be correct by construction.”

Persistent data-structures vs mutable data-structures

Dok should derive automatically when DS are mutable in-place. But note that in Clojure and in many persistent DS it is possible opening a transient phase, where a mutable subset (transient) is managed from a function, and then only at the end merged with the persistent DS. So we have the best of two worlds, when in some parts of the code the persistent DS is managed in mutable way

Slow backend Compiler

I can have a slow backend compilation-process (also on distinct servers) executing:

  • slow/non interactive code checks
  • static analysis of code for common errors
  • compiling
  • advanced optimizations
  • execute slow regression tests

Optimizations Level vs Flexibility Levels optimization

Whole compilation analysis:

  • faster code
  • slow code generation
  • resulting code is not extensible at run time

It can be done only for most important parts:

  • instrumented, profiling, detect wich are
  • parts without run-time settings

Other parts can use normal code:

  • fast compilation
  • development mode
  • can be integrated with external libraries
  • like GHCi approach

Other parts can use run-time code:

  • GUI
  • fully user definable at run-time
  • interpreted
  • slow but flexible

Or there can be a merge of these tecniques in a JIT-like approach:

  • the compiler is part of the run-time environment
  • compiled code is cached
  • code is profiled
  • code and path with more execution are merged

https://github.com/facebookincubator/SPARTA#sparta

Libraries for abstract interpretation.

Abstract Interpretation

Abstract Interpretation can be used for a lot of things: https://www.di.ens.fr/~cousot/AI/#CITECominiDamianiVrech08-SAS

  • type checking
  • call by need without using thunks
  • gradual typing

Execute code at compile-time using types instead of values if they are not known for deriving many useful properties of the code.

Type errors became compile-time exceptions. It is good to use with gradual typing.

Not proved requirements can be signaled.

Code can be optimized if it is used only in a certain way.

Not checkable code is checked at run-time, and it generates an exception of type “Error in the code”.

Some heavy to test invariants are checked at run-time but only during debugging/test phase.

It is a form of partial evaluation.

It is a form of global program analysis because it simulate the real usage of a certain piece of code, and so it can apply global optimizations.

The idea it is that compilation with global analysis is often needed in complex DSL for applying powerful transformations. These are difficult to express using an interpreter. So in Dok I can:

  • apply abstract interpretation for simplifying the code
  • applying global analysis rules on simplified code
  • iterate

Typing and optimizations are the same idea here:

  • abstract interpretation is execution of code at compile-time
  • code to compile became result of abstract interpretation
  • errors in the code at run-time became @Require and @Ensure that are not proovable true at compile-time, but that are checked at run-time and they generated error-in-the-code exceptions
  • unikernel-like approach
  • similar to SmartEiffel idea to global compilation of code, and derive compact TAGS for polymorphic call, simplifing the code
  • during development one can executed unoptimized code
  • high-level analysis compilations can be seen as the interpreter calling first a “global” analysis of certain type of code

See for example waitAll in Asterisell approach where the Haskell type system was not powerfull enough for expressing a rather simple type constraint.

MAYBE I can vie “f := 2” as “f::Int, f = 2” and not only “f::Int” because I have in this case the correct info that “f is 2”. So I have a series of contracts added to a variable with the most precise inferred value using abstract-interpretation. Type errors can be a compile-time (when I’m sure of an error) or at run-time with “error in the code” exception when I had to test them at run-time.

TODO see also [1]S. Collin, D. Colnet, e O. Zendra, «Type Inference for Late Binding. The SmallEi el Compiler.», and [1]O. Zendra, D. Colnet, e S. Collin, «Efficient Dynamic Dispatch without Virtual Function Tables. The SmallEi el Compiler.», pag. 17., because it can be used for transforming abstract metaprogramming based Dok code into efficient code with a lot of static binding.

NOTE: in any case types (and also refinement types) are useful because types caputer code specification, while abstraction interpretation code implementation

Partial Evaluation

TODO study the difference between tracing (RPython probably) and partial evaluation (Graal probably) and abstract interpretation

http://graphit-lang.org/

Parallel language:

  • synthetize algo from generic language, according the size and structure of the data set
  • a lot faster than other graph libraries

TODO see the language format for inspirations. But probably there are other libraries for managing graph to use.

Rewriting rules

TODO study supercompilation tecniques because they are based on rewriting rules

OO semantic object-orientedformal-specification

FACT Dok is mainly value based, while OO mainly ref based, but maybe is still useful.

[1]B. Meyer, «Towards a Calculus of Object Programs», in Patterns, Programming and Everything, K. K. Breitman e R. N. Horspool, A c. di London: Springer London, 2012, pagg. 91–127.

[1]«Coping with aliasing in the GNU Eiffel compiler», n. 0.

[1]B. Meyer e A. Kogtenkov, «Negative Variables and the Essence of Object-Oriented Programming», in Specification, Algebra, and Software, vol. 8373, S. Iida, J. Meseguer, e K. Ogata, A c. di Berlin, Heidelberg: Springer Berlin Heidelberg, 2014, pagg. 171–187.

[1]O. Zendra, D. Colnet, e S. Collin, «Efficient Dynamic Dispatch without Virtual Function Tables. The SmallEi el Compiler.», pag. 17.

[1]P. Ribet, C. Adrian, O. Zendra, e D. Colnet, «Conformance of agents in the Eiffel language.», The Journal of Object Technology, vol. 3, n. 4, pag. 125, 2004.

Instruction Cache Miss

It seems that instruction cache misses are very very expensives. For MySQL a 10% of code cache miss was a 40% performance penality, and this reducing the misses only from 10% to 8%!!!!

Use LLVM tools and similar that can analyze code at run-time and then recompile in a way for reducing cache-miss. Probably JIT is doing this automatically.