SPAWK - SQL Powered AWK
AWK is a convenient and expressive programming language that can be used in a wide variety of computing and data manipulation tasks. MySQL is an open-source relational database management system (RDBMS), while MariaDB is a community-developed fork of the MySQL RDBMS, created after Oracle corporation acquired Sun Micorsystems (and MySQL) at 2008. MariaDB intends to maintain high compatibility with MySQL, ensuring a drop-in replacement capability with library binary equivalency and exact matching with MySQL APIs and commands. SPAWK enables AWK to submit SQL queries to MySQL/MariaDB relational databases; query results are returned back to AWK for further processing.
Introduction to AWK
AWK is a well known programmable pattern matching and data manipulation software tool found in every UNIX-like operating system. Every AWK program is in the form of:
pattern { action } pattern { action } ...
AWK reads textual input on a line by line basis.
Each line is checked against every one of the given patterns
and whenever a line matches a pattern, the corresponding action is taken.
The keyword next
can be used in an action to skip the remaining patterns.
There is a special pattern named BEGIN
with an action to be
taken before any input line has been read.
Another special pattern named END
can be used for actions to be taken
after all input lines has been read.
AWK splits each input line in fields. Every single word of the input
line is considered as a field, but that can be altered using the special
variable FS
(field separator).
The total number of fields in a line is stored in the NF
variable, while the fields can be accessed through the $1
,
$2
… special variables. The whole input line itself can be
accessed through the special variable $0
.
Given a list of integer numbers, the following program will count the multiples of 2, 3, 5 and 7:
BEGIN { mul2 = 0; mul3 = 0; mul5 = 0; mul7 = 0 } ($0 % 2) == 0 { mul2++ } ($0 % 3) == 0 { mul3++ } ($0 % 5) == 0 { mul5++ } ($0 % 7) == 0 { mul7++ } END { print mul2, mul3, mul5, mul7 }
Given the input number 8 only one action is taken, because the remainder of the division of 8 by 2 is zero, while the divisions of 8 by 3, 5 and 7 are non-zero. However, for the input number 42 three actions are taken because the remainder of division of 42 by 2, 3 and 7 are zero. For the input number 97 no action is taken because 97 is a prime number.
Using SQL to produce primary data
Without using SPAWK, the only way to process data stored in a MySQL/MariaDB database with AWK is to extract the desired data from the database and then pass the extracted data to AWK. Assume that we have a chat website based on a MySQL/MariaDB database. There exists a user table in the database, holding a record for each user registered user. There also exists a relation table holding the relations between users, that is users marked as friends or blocked by other users.
We want to produce a report with all registered users, along with the count of
friends and blocked users for every one of them.
The report must be sorted by login names.
We may use plain vanilla SQL to produce such a report, but maybe it's not so easy to
do so.
In order to avoid COUNT
syntax complexities which may lead to
erroneus results, we decide to use a couple of SQL scripts to extract the desired
data from the database and then pass the results to AWK to produce the final report.
First thing to do is to produce a list of all registered users,
even those users that are not yet related with other users:
SELECT `login` FROM `user`;
Use the mysql
standard client program to run the above query and
store the result rows in a file.
Each row consists of just one column, namely the login name of the user:
panos maria smith chris arnold brian alfred peter ...Next thing to do is to produce relations' counts for all users having related users:
SELECT `user`, `relationship`, COUNT(*) FROM `relation` GROUP BY `user`, `relationship`;
The output of the above query contains three columns, namely the login name, the relation kind (FRIEND or BLOCKED) and the corresponding count:
alfred FRIEND 22 alfred BLOCKED 13 arnold BLOCKED 3 brian FRIEND 18 brian BLOCKED 2 maria FRIEND 6 panos FRIEND 10 panos BLOCKED 7 ...
We can now sort all of the above output and pass the sorted data to AWK. Lines read by AWK will have either one field (output from the first query), or three fields (output from the second query):
alfred alfred FRIEND 22 alfred BLOCKED 13 arnold arnold BLOCKED 3 brian brian FRIEND 18 brian BLOCKED 2 chris maria maria FRIEND 6 panos panos FRIEND 10 panos BLOCKED 7 ...
Using AWK to process primary data
The fact is that because the data are sorted, there exists one line with the user login name for all registered users and after each name may follow at most two lines with relations' counts. To form the desired report we pass the sorted data to AWK:
NF == 1 { # New user encountered. Print previous user data. if ($1 != user) print_user() # Now keep new user name in mind and reset counters. user = $1 delete count next } # This line comes from the second SQL query. First column is # the user name, second column is the relationship (FRIEND or # BLOCKED) and third column is the the relevant count. NF == 3 { count[$2] = $3 next } { print $0 ": syntax error" >"/dev/stderr" } # After all input has been read, the last user statistics must # be printed. END { print_user() } function print_user() { if (user) print user, count["FRIEND"] + 0, count["BLOCKED"] + 0 }
The output of the above AWK script will be:
alfred 22 13 arnold 0 3 brian 18 2 chris 0 0 maria 6 0 panos 10 7 ...
Assuming that the above scripts were stored as relcnt1.sql
,
relcnt2.sql
and relcnt.awk
, we can run
the following one-liner to produce the desired report:
sort <$(mysql <relcnt1.sql) <$(mysql <relcnt2.sql) | awk -f relcnt.awk
Of course, mysql
must run with the appropriate options for
user authentication and to produce raw results free of any kind of headers.
Using SPAWK for elegance and simplicity
This was a pretty cumbershome and tedious process to carry out a simple report. Moreover, all of the above lack any elegance at all and this is usually a sign of bad software engineering. But there is no reason to dispair, here comes SPAWK to our rescue:
@load "spawk" BEGIN { spawk_sesami["dbuser"] = "chat" spawk_sesami["dbpassword"] = "xxx" spawk_sesami["dbname"] = "chat" spawk_submit("SELECT `login` FROM `user` ORDER BY `login`") while (spawk_fetchrow(user)) process_user(user[1]) } function process_user(login, relation, count) { spawk_submit("SELECT `relationship` FROM `relation` WHERE `user` = " spawk_escape(user[1])) while (spawk_fetchrow(relation)) count[relation[1]]++ print user[1], count["FRIEND"] + 0, count["BLOCKED"] + 0 }
In the BEGIN
section we provide the credentials for SPAWK
to open the chat database and then submit the main SQL query to select
the login names of all the registered users in the desired order.
For each user selected we submit another SQL query to select the user's relations.
For each selected relation we increase the count of the correspondig relationship
(FRIEND or BLOCKED).
After processing all the relations for each user, we print the user's login name
and the relations' counts of interest.
Loading the SPAWK dynamic extension
In order to run SQL from within AWK scripts, AWK must be supplied with an API of
AWK functions. This API is the spawk.so
dynamic extension library and it's just a collection
of less than ten AWK functions, along with some gobal variables, all of them named
as spawk_
something, e.g. spawk_submit()
(function),
spawk_fetchrow()
(function), spawk_sesami
(array),
spawk_verbose
(variable) etc.
In order for AWK to be equipped with the SPAWK API functions and variables,
the spawk.so
shared library must be loaded.
This can be achieved either by including the spawk.so
shared library in the command line:
awk --load spawk -f relcnt.awk
or by loading the spawk.so
shared library inside the main AWK script using the
@load
directive:
@load "spawk" BEGIN { ...
In order to avoid using full pathnames it's advisable to set AWKLIBPATH
accordingly, e.g. by adding a line in the /etc/environment
file:
export AWKLIBPATH="/home/panos/lib/gawk:/usr/local/lib/gawk"
It's even better for spawk.so
file to be located in the
/usr/local/lib/gawk
directory because this directory is included
to the default AWKLIBPATH
, so there's no need to set or change anything;
however, you may need administrative permissions to locate the spawk.so
file in the /usr/local/lib/gawk
directory.