Introduction

lawk is a jit compiler for awk that targets the llvm (http://www.llvm.org). It was born as a way of experimenting with the llvm infrastructure in a setup more complex than the tutorials. As such, its main purpose is didactic and I do not forsee it becoming a real alternative to the already available implementations of awk. Its foundation lies in GNU awk.

State

A significant part of awk is already implemented, but there is still a long way to go in order to cover the most common cases.

Compilation of functions (including begin, end and block and user defined functions) is already in place. Also most of the control blocks (if, for, while, do). Referencing to global variables and field variables are supported ($0, $1, etc).

The following operator/functions are implemented:

The missing list:

Internals

the jit compiler generates llvm style assembler. The begin, end block and user defined functions are each defined in their own function, while all the other processing blocks are generated within the same function. All the variables are converted to an internal representation which uses pointer tagging to distinguish among the different types. Integers are inlined in the value, while for strings and doubles the value contained in the variables are pointers to the corresponding object.

Example s

Let's start with a simple script that prints a number:

$> ./gawk -c 'BEGIN { print 2; }'                     ;(1)

will generate:

@iconst = constant i64 16                   ;(2)

define void @main_fun() {                               ;(3)

entry:

ret void

}

define void @begin_fun() {

entry:

br label %then_block                              

then_block:

call void @print_ppointer( i64* @iconst )  ;(4)

br label %if_end_block

if_end_block:

ret void

}

define void @end_fun() {

entry:

ret void

}

declare void @print_ppointer(i64*)                   ;(5)

 

 

 

(1) the -c flag is to tell gawk to use the compile. This is needed because lawk coexist with the GNU awk implementation

(2) the constant number 2 is already tagged here and that is why it shows up as 16

(3) main_func contains all the block functions, none in this case, but it is always defined. The same goes for begin and end functions even if they are not part of the script they will be defined with an empty body.

(4) the call to print. Note that it does name mangling on the arguments, in this case we are passing a pointer to a 'pointer' object and so the extra bit becomes ppointer.

(5) the extern declaration generated by the llvm

This second example sums up all the numbers in a file and print out the total (showing only the important parts of the generated code)

$> ./gawk -c '{ sum += $0; } END { print sum; }' < numbers

 

Generates:

@sum = global i64 0                                                               ;(1)

@iconst = constant i64 0

define void @main_fun() {

entry:

br label %then_block

then_block:

call i64 @retrieve_field_ppointer( i64* @iconst )       ;%0       (2)

call i64 @plus_ppointer_pointer( i64* @sum, i64 %0 )    ;%1 (3)

call void @assign_ppointer_pointer( i64* @sum, i64 %1 )     ;(4)

br label %if_end_block

if_end_block:

ret void

}

define void @end_fun() {

entry:

br label %then_block

then_block:

call void @print_ppointer( i64* @sum )                               ;(5)

br label %if_end_block

if_end_block:

ret void

}

 

                     

(1) declaration of the global values and constante used in the functions. In this case we are using two one for 'sum' and the other for the field index in '$0'

(2) calling retrieve_field to get the value of $0. The argument is the constant previously defined. Note that the constant is as well a tagged 'pointer'

(3) call to plus with sum and the result of the previous function call (denoted by %0). This line is a partial translation of sum+=$0 to sum+$0

(4) the assignment part of sum+=$0 taking as argument 'sum' and the result of the previous function call

(5) calling the print function with 'sum' as paramenter

In order to see the generated assembler one needs to pass the flag -a to lawk