LISC v2: Learning Instruction Semantics from Code Generators
Lifting assembly into higher-level intermediate language is an essential step for any binary analysis and instrumentation system. Existing systems are not scalable since they often require a great deal of manual effort to support new architectures. Moreover, full instruction sets may not be captured even for well-known architectures such as x86 or x86_64 processors, e.g., mature systems such as Valgrind still lack support for AVX, FMA4 and SSE4.1 for x86 processors.
To overcome these difficulties, we have developed a novel approach that extracts knowledge of instruction set semantics embedded in modern compilers. In particular, we choose GCC's RTL because it can capture semantics of instruction sets on hardware registers, which is required for lifting assembly. Our experimental evaluation demonstrates LISC's ability to support diverse architectures, as well as its correctness and completeness.
What does the IR look like?The IR (intermediate representation) makes the semantics of instructions explicit. Below is an example that maps a div instruction to its IR. Note that neither ax nor dx is an operand of the assembly instruction, but the IR shows that the quotient is in ax register and the remainder in dx register.
divl %r8d; (parallel [(set (reg:SI ax) (udiv:SI (reg:SI ax) (reg:SI r8))) (set (reg:SI dx) (umod:SI (reg:SI ax) (reg:SI r8))) (clobber (reg:CC flags))])
Note also the limitation of the IR: it does not indicate how the flags are changed, but it does capture the fact that the flags are modified ("clobbered"). For this reason, the IR is useful for (sound) static analysis, but not for faithful execution.
What is the target audience?LISC targets the developers of binary analysis and instrumentation systems, and provides them a unique capability: it can support any processor that is supported by GCC. Moreover, it can capture the semantics of all instructions used by the compiler, including floating point instructions and the latest extensions to the instruction set.
In order to support a new processor, a developer needs to modify the assembly parser that is included. Typically, changes are limited to a few tens of lines of ocaml code. (If you are interested in the ubiquitous x86_64 architecture or ARM, all needed code is already included; but keep in mind that the former is better debugged.)
If you are interested in a specific architecture, then some of the LISC code (namely, its learning component) is not necessary. In the near future, we expect to make releases that include complete maps for some of the most popular architectures. (You can still benefit from our tool that lifts an entire binary to IR.)
Version 2 Software Release
LISC (Learning Instruction-set Semantics using Code Generator) is a learning based system which automatically builds assembly to IR translators using code generators of modern compilers. Specifically, this release contains software for:
- learning x86_64 assembly to GCC RTL translation, and
- lifting x86_64 assembly snippets to our IR, which is GCC RTL
Note that the generated GCC RTL is architecture-independent, except for the fact that it uses hardware registers that are defined for a specific architecture (x86_64 in this case).
This version 2 release aims to
- improve stability of our previous release, and
- support x86_64 architecture.
Also included is code for ARM and AVR architectures, but this code has not been updated since the previous release of this software.
Previous version of LISC is no longer current, but may still be accessed here.
This work was supported by ONR grant N00014-17-1-2891.