SPAWK - SQL Powered AWK

AWK is a convenient and expressive programming language that can be used in a wide variety of computing and data manipulation tasks. MySQL is an open-source relational database management system (RDBMS), while MariaDB is a community-developed fork of the MySQL RDBMS, created after Oracle corporation acquired Sun Micorsystems (and MySQL) at 2008. MariaDB intends to maintain high compatibility with MySQL, ensuring a drop-in replacement capability with library binary equivalency and exact matching with MySQL APIs and commands. SPAWK enables AWK to submit SQL queries to MySQL/MariaDB relational databases; query results are returned back to AWK for further processing.

Introduction to AWK

AWK is a well known programmable pattern matching and data manipulation software tool found in every UNIX-like operating system. Every AWK program is in the form of:

pattern { action }
pattern { action }
...

AWK reads textual input on a line by line basis. Each line is checked against every one of the given patterns and whenever a line matches a pattern, the corresponding action is taken. The keyword next can be used in an action to skip the remaining patterns. There is a special pattern named BEGIN with an action to be taken before any input line has been read. Another special pattern named END can be used for actions to be taken after all input lines has been read.

AWK splits each input line in fields. Every single word of the input line is considered as a field, but that can be altered using the special variable FS (field separator). The total number of fields in a line is stored in the NF variable, while the fields can be accessed through the $1, $2… special variables. The whole input line itself can be accessed through the special variable $0.

Given a list of integer numbers, the following program will count the multiples of 2, 3, 5 and 7:

BEGIN { mul2 = 0; mul3 = 0; mul5 = 0; mul7 = 0 }
($0 % 2) == 0 { mul2++ }
($0 % 3) == 0 { mul3++ }
($0 % 5) == 0 { mul5++ }
($0 % 7) == 0 { mul7++ }
END { print mul2, mul3, mul5, mul7 }

Given the input number 8 only one action is taken, because the remainder of the division of 8 by 2 is zero, while the divisions of 8 by 3, 5 and 7 are non-zero. However, for the input number 42 three actions are taken because the remainder of division of 42 by 2, 3 and 7 are zero. For the input number 97 no action is taken because 97 is a prime number.

Using SQL to produce primary data

Without using SPAWK, the only way to process data stored in a MySQL/MariaDB database with AWK is to extract the desired data from the database and then pass the extracted data to AWK. Assume that we have a chat website based on a MySQL/MariaDB database. There exists a user table in the database, holding a record for each user registered user. There also exists a relation table holding the relations between users, that is users marked as friends or blocked by other users.

We want to produce a report with all registered users, along with the count of friends and blocked users for every one of them. The report must be sorted by login names. We may use plain vanilla SQL to produce such a report, but maybe it's not so easy to do so. In order to avoid COUNT syntax complexities which may lead to erroneus results, we decide to use a couple of SQL scripts to extract the desired data from the database and then pass the results to AWK to produce the final report. First thing to do is to produce a list of all registered users, even those users that are not yet related with other users:

SELECT `login` FROM `user`;

Use the mysql standard client program to run the above query and store the result rows in a file. Each row consists of just one column, namely the login name of the user:

panos
maria
smith
chris
arnold
brian
alfred
peter
...

Next thing to do is to produce relations' counts for all users having related users:

SELECT `user`, `relationship`, COUNT(*) FROM `relation` GROUP BY `user`, `relationship`;

The output of the above query contains three columns, namely the login name, the relation kind (FRIEND or BLOCKED) and the corresponding count:

alfred		FRIEND		22
alfred		BLOCKED		13
arnold		BLOCKED		 3
brian		FRIEND		18
brian		BLOCKED		 2
maria		FRIEND		 6
panos		FRIEND		10
panos		BLOCKED		 7
...

We can now sort all of the above output and pass the sorted data to AWK. Lines read by AWK will have either one field (output from the first query), or three fields (output from the second query):

alfred
alfred		FRIEND		22
alfred		BLOCKED		13
arnold
arnold		BLOCKED		 3
brian
brian		FRIEND		18
brian		BLOCKED		 2
chris
maria
maria		FRIEND		 6
panos
panos		FRIEND		10
panos		BLOCKED		 7
...

Using AWK to process primary data

The fact is that because the data are sorted, there exists one line with the user login name for all registered users and after each name may follow at most two lines with relations' counts. To form the desired report we pass the sorted data to AWK:

NF == 1 {
	# New user encountered. Print previous user data.

	if ($1 != user)
	print_user()

	# Now keep new user name in mind and reset counters.

	user = $1
	delete count

	next
}

# This line comes from the second SQL query. First column is
# the user name, second column is the relationship (FRIEND or
# BLOCKED) and third column is the the relevant count.

NF == 3 {
	count[$2] = $3
	next
}

{
	print $0 ": syntax error" >"/dev/stderr"
}

# After all input has been read, the last user statistics must
# be printed.

END {
	print_user()
}

function print_user() {
	if (user)
	print user, count["FRIEND"] + 0, count["BLOCKED"] + 0
}

The output of the above AWK script will be:

alfred		22	13
arnold		 0	 3
brian		18	 2
chris		 0	 0
maria		 6	 0
panos		10	 7
...

Assuming that the above scripts were stored as relcnt1.sql, relcnt2.sql and relcnt.awk, we can run the following one-liner to produce the desired report:

sort <$(mysql <relcnt1.sql) <$(mysql <relcnt2.sql) | awk -f relcnt.awk

Of course, mysql must run with the appropriate options for user authentication and to produce raw results free of any kind of headers.

Using SPAWK for elegance and simplicity

This was a pretty cumbershome and tedious process to carry out a simple report. Moreover, all of the above lack any elegance at all and this is usually a sign of bad software engineering. But there is no reason to dispair, here comes SPAWK to our rescue:

@load "spawk"

BEGIN {
	spawk_sesami["dbuser"] = "chat"
	spawk_sesami["dbpassword"] = "xxx"
	spawk_sesami["dbname"] = "chat"

	spawk_submit("SELECT `login` FROM `user` ORDER BY `login`")

	while (spawk_fetchrow(user))
	process_user(user[1])
}

function process_user(login,		relation, count) {
	spawk_submit("SELECT `relationship` FROM `relation` WHERE `user` = " spawk_escape(user[1]))

	while (spawk_fetchrow(relation))
	count[relation[1]]++

	print user[1], count["FRIEND"] + 0, count["BLOCKED"] + 0
}

In the BEGIN section we provide the credentials for SPAWK to open the chat database and then submit the main SQL query to select the login names of all the registered users in the desired order. For each user selected we submit another SQL query to select the user's relations. For each selected relation we increase the count of the correspondig relationship (FRIEND or BLOCKED). After processing all the relations for each user, we print the user's login name and the relations' counts of interest.

Loading the SPAWK dynamic extension

In order to run SQL from within AWK scripts, AWK must be supplied with an API of AWK functions. This API is the spawk.so dynamic extension library and it's just a collection of less than ten AWK functions, along with some gobal variables, all of them named as spawk_something, e.g. spawk_submit() (function), spawk_fetchrow() (function), spawk_sesami (array), spawk_verbose (variable) etc.

In order for AWK to be equipped with the SPAWK API functions and variables, the spawk.so shared library must be loaded. This can be achieved either by including the spawk.so shared library in the command line:

awk --load spawk -f relcnt.awk

or by loading the spawk.so shared library inside the main AWK script using the @load directive:

@load "spawk"

BEGIN {
...

In order to avoid using full pathnames it's advisable to set AWKLIBPATH accordingly, e.g. by adding a line in the /etc/environment file:

export AWKLIBPATH="/home/panos/lib/gawk:/usr/local/lib/gawk"

It's even better for spawk.so file to be located in the /usr/local/lib/gawk directory because this directory is included to the default AWKLIBPATH, so there's no need to set or change anything; however, you may need administrative permissions to locate the spawk.so file in the /usr/local/lib/gawk directory.