SPAWK Tutorial
In this page we setup a MySQL/MariaDB test database and use SPAWK to access this database with AWK. You can download most of the files and scripts found in this page and run everything locally, though you must be careful in order not to damage your databases or other files or scripts. Have fun!
Download and install SPAWK
In order to run this tutorial, SPAWK must be downloaded and installed locally.
After downloading SPAWK tarball, you may have to download the MySQL/MariaDB
libmysqlclient.so
dynamic library and
mysql_config
utility program in order to compile and link the
spawk
gawk dynamic extension.
Refer to the [ Download ]
page for further instructions on how to download and install SPAWK to your
local system.
Warning!
Do not remove the spawk
directory structure extracted from the SPAWK tarball
after installation, because some files will be needed for this tutorial
course to complete.
After fininsihing this tutorial course the files will no longer be needed,
so you can safely remove the whole spawk
directory structure.
The chat application
This tutorial page is about a hypothetical chat application, where users all over the world can register and start chatting to each other. Users can setup relationships to other users marking them as friends or blocking them out. Users can also send off-line messages to each other, forming something like an internal email system inside the chat application. Later on we may add rooms, dates, or other useful objects to the chat application.
Our chat application consists of the usual major components such as web server, database, server side programs, client side programs etc. In this tutorial we cope mainly with the database and some server side SPAWK programming. We'll show you how to use SPAWK effectiveley for a wide range of database operations, from creating and populating the database, to producing useful database reports and statistics.
The chat database
The whole chat application lies on a MySQL/MariaDB database, holding users, user relations, messages etc. The schema of the database is by no means complicated; there exist less than five tables with a total of no more than twenty columns.
As expected, the most significant table in our chat database is a table holding the users of the application; we name this table as the user table. Users can be related to each other, that is they can mark selected users as friends, or block other users if they feel so. User relationships are kept, of course, in another table which we name as the relation table. Users may send offline messages to each other; sent messages form the message table. For now we don't need any other tables to setup our chat database.
The schema
We quote the chat database schema in SQL/DDL format:
-- Application "chat" is about web chatting between application users. -- "chat" application, lies on the "chat" database were user data, -- user relationships, user messages etc are stored. -- (Re)create the "chat" database. DROP DATABASE IF EXISTS `chat` ; CREATE DATABASE `chat` DEFAULT CHARSET = utf8mb4 DEFAULT COLLATE = utf8mb4_general_ci ; -- Select "chat" database as the default database. USE `chat` ; -- Table "user" is the most significant table of the "chat" database. -- Each row represents a registered user of the "chat" application. CREATE TABLE `user` ( `login` VARCHAR(64) NOT NULL COMMENT 'Login name', `registration` TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP COMMENT 'Registration date', `name` VARCHAR(128) NOT NULL COMMENT 'Full name', `email` VARCHAR(128) NULL DEFAULT NULL COMMENT 'e-mail', `gender` ENUM ( 'MALE', 'FEMALE' ) NULL DEFAULT NULL COMMENT 'Gender', `byear` INTEGER(4) NULL DEFAULT NULL COMMENT 'Birth year', -- user passwords are stored in SHA1 `password` CHARACTER(40) COLLATE utf8_bin NOT NULL COMMENT 'Password', PRIMARY KEY ( `login` ) USING BTREE, INDEX ( `name` ) USING BTREE, UNIQUE INDEX ( `email` ) USING BTREE ) ENGINE = InnoDB COMMENT = 'Person table' ; -- Table "relation" holds the relationships between users. -- There are three kinds of relationships between any two users: -- The users are unrelated, friends or blocked. CREATE TABLE `relation` ( `user` VARCHAR(64) NOT NULL COMMENT 'Login name', `related` VARCHAR(64) NOT NULL COMMENT 'Related login name', `relationship` ENUM ( 'FRIEND', 'BLOCKED' ) NOT NULL COMMENT 'Relationship', PRIMARY KEY ( `user`, `related` ) USING BTREE, INDEX ( `related` ) USING BTREE ) ENGINE = InnoDB COMMENT = 'Relation table' ; -- Table "message" holds the messages between users. CREATE TABLE `message` ( `id` INTEGER(9) NOT NULL AUTO_INCREMENT COMMENT 'Message ID', `sender` VARCHAR(64) NOT NULL COMMENT 'Sender', `recipient` VARCHAR(64) NOT NULL COMMENT 'Recipient', `date` TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP COMMENT 'Date', `text` VARCHAR(10000) NOT NULL COMMENT 'Text', PRIMARY KEY ( `id` ) USING BTREE, INDEX ( `sender` ) USING BTREE, INDEX ( `recipient` ) USING BTREE ) ENGINE = InnoDB COMMENT = 'Message table' ; ALTER TABLE `relation` ADD FOREIGN KEY ( `user` ) REFERENCES `user` ( `login` ) ON UPDATE CASCADE ON DELETE CASCADE ; ALTER TABLE `relation` ADD FOREIGN KEY ( `related` ) REFERENCES `user` ( `login` ) ON UPDATE CASCADE ON DELETE CASCADE ; ALTER TABLE `message` ADD FOREIGN KEY ( `sender` ) REFERENCES `user` ( `login` ) ON UPDATE CASCADE ON DELETE CASCADE ; ALTER TABLE `message` ADD FOREIGN KEY ( `recipient` ) REFERENCES `user` ( `login` ) ON UPDATE CASCADE ON DELETE CASCADE ; -- Create user for generic DQL/DML access to "chat" database. DROP USER IF EXISTS 'chat'@'localhost' ; CREATE USER 'chat'@'localhost' IDENTIFIED BY 'xxx' ; GRANT SELECT, INSERT, UPDATE, DELETE ON `chat`.* TO 'chat'@'localhost' ; -- Create user for generic DQL access to "chatdql" database. DROP USER IF EXISTS 'chatdql'@'localhost' ; CREATE USER 'chatdql'@'localhost' ; GRANT SELECT ON `chat`.* TO 'chatdql'@'localhost' ; -- Create user for DDL/DQL/DML access to "chat" database. DROP USER IF EXISTS 'chatadmin'@'localhost' ; CREATE USER 'chatadmin'@'localhost' IDENTIFIED BY 'xxx' ; GRANT ALL PRIVILEGES ON `chat`.* TO 'chatadmin'@'localhost' ;
Creating the database
To create the database locally, you must download the file and run the following command as the root database user:
mysql -u root -p --force <schema.sql
However, there is no need to download the above file because this file
is included in the spawk
tarball downloaded earlier.
The file schema.sql
is located in the chat
directory under the
database
directory.
Warning!!!
If you have an existing chat database in your system, it will be deleted!
Database users chat and chatadmin will be dropped too!
Populating the database
After creating the chat database and the chat and chatadmin database users, the application is ready to accept new user registrations. After registering at the application, users will start chatting, developing relationships and sending messages to each other and after a while we will have enough data in the chat database for testing and demonstrating SPAWK. However, we cannot wait for users to register in the chat application, moreover we have not setup any web server, neither we have developed any server or client side programs yet for the chat application to be up and running, so we decide to populate the database manually using existing data files.
In order to populate the chat database with test data, visit once again
the spawk
directory extracted from the downloaded SPAWK tarball.
There exists a database
directory, and a chat
subdirectory under
the database
directory.
In this directory there exist two data files namely the user.data
containing
2000 test user data, and the relation.data
containing 30000 relationship pairs
between users.
Our intention is to load these data files into the database by using SPAWK
for the first time!
In the same directory there exists an AWK script named populate.awk
.
Let's take a look:
@load "../../src/spawk/spawk" BEGIN { FS = "\t" spawk_sesami["dbuser"] = "chat" spawk_sesami["dbpassword"] = "xxx" spawk_sesami["dbname"] = "chat" } NF == 7 { nf = 1 login = $(nf++) registration = $(nf++) name = $(nf++) email = $(nf++) gender = $(nf++) byear = $(nf++) password = $(nf++) email = email ? spawk_escape(email) : "NULL" gender = gender ? spawk_escape(gender) : "NULL" byear = byear ? byear : "NULL" if (spawk_submit("INSERT INTO `user` (" \ "`login`, " \ "`registration`, " \ "`name`, " \ "`email`, " \ "`gender`, " \ "`byear`, " \ "`password`" \ ") VALUES (" \ spawk_escape(login) ", " \ "FROM_UNIXTIME(" registration "), " \ spawk_escape(name) ", " \ email ", " \ gender ", " \ byear ", " \ spawk_escape(password) \ ")") != 2) next if (!(++ucount % 1000)) print ucount, "users inserted" next } NF == 3 { if (spawk_submit("INSERT INTO `relation` (" \ "`user`, " \ "`related`, " \ "`relationship`" \ ") VALUES (" \ spawk_escape($1) ", " \ spawk_escape($2) ", " \ spawk_escape($3) \ ")") != 2) next if (!(++rcount % 1000)) print rcount, "relations inserted" next } { print $0 ": syntax error" >"/dev/stderr" }
The test data files are the user.data
file containing 2000 users and the
relation.data
file containing 30000 relationships.
It's obvious that the relationships must be loaded after the users, because of
the foreign key constraints between the corresponding tables.
user.data
file has 7 tab separated fields, while
relation
file has 3 tab separated fields:
spawk_escape
function takes a string argument and enclose this
string in single quotes, after escaping single quotes found in the string,
e.g. spawk_escape("panos")
returns the string 'panos'
,
while spawk_escape("smith's")
returns the string 'smith\'s'
.
The data files
File user.data
:
fandrusyak0 1201744766 Freddie Andrusyak fandrusyak0@usnews.com MALE 1950 ce8b6645765ade3978743c7699314f5f40c4d9ac mberns1 1386296786 Morton Berns mberns1@hubpages.com MALE 2010 7ad26d2cb2fff6105a027bf26f49d2829994836c rgerard2 1452009325 Ricard Gerard rgerard2@upenn.edu MALE 1980 3d9431a625a1a468f67a4f06d51c890941c70046 aakerman3 1103639500 Al Akerman aakerman3@goo.gl MALE 1958 e36fcf09ca1d4cbdc3c0b30ca9bd211c12b6a96f pgrimolbie4 1358742544 Pincus Grimolbie pgrimolbie4@mayoclinic.com MALE 1984 dc268bc27174b8f8619f1266e6fd259d7895a06e mwarlaw5 1476605358 Manfred Warlaw mwarlaw5@deviantart.com MALE 1984 731bab1308a3c1e095e033390d8baf1ec8832fc2 abawdon6 1345679245 Alon Bawdon abawdon6@msu.edu MALE 1962 7814e192fd6b0a44a5d1503550a802742a89b2ae bullyatt7 1114354347 Bevon Ullyatt bullyatt7@joomla.org MALE 1980 0a969515bbd302125105fa0946c3e4f062abf97e aspridgen8 1121667581 Alaric Spridgen aspridgen8@list-manage.com MALE 1998 f2af7ca7654bf83e70f1af35937562c0caa3319a vrayment9 1424094597 Vincenz Rayment vrayment9@google.com.hk MALE 1953 de68fbd1e262a988869c4b03015abc8a9d88f293 ...
File relation.data
:
mberns1 dinger59 BLOCKED mberns1 cdackombe4f FRIEND mberns1 vgyenesfz FRIEND mberns1 mwillougheyii FRIEND mberns1 fcatterall5n FRIEND mberns1 tsiredio FRIEND mberns1 obountiffrm FRIEND mberns1 csamsinov FRIEND mberns1 ccraydenbo BLOCKED mberns1 mrugge6p FRIEND ...
Before using SPAWK to load the data in the database, it's good to trace the
populate.awk
script one section at a time:
SPAWK setup
@load "spawk" BEGIN { FS = "\t" spawk_sesami["dbuser"] = "chat" spawk_sesami["dbpassword"] = "xxx" spawk_sesami["dbname"] = "chat" } ...
The very first line of the script is just the inclusion of the spawk
dynamic extension. The extension has already been installed, possibly at the
/usr/local/lib/gawk directory. This directory is searched by default
from AWK, as AWKLIBPATH
default value is "/usr/local/lib/gawk
"
as stated in the relative
section of the AWK's user guide.
Just after loading the spawk
extension, the BEGIN
section
follows. Our test data files contain lines of tab separated fields, so we set the input
fields separator to the tab character.
Afterwards follows the spawk_sesami
array setup, where the database access
credentials are set in order to use them later, whenever a database access is needed
to start a new client.
Filtering user data
... NF == 7 { nf = 1 login = $(nf++) registration = $(nf++) name = $(nf++) email = $(nf++) gender = $(nf++) byear = $(nf++) password = $(nf++) email = email ? spawk_escape(email) : "NULL" gender = gender ? spawk_escape(gender) : "NULL" byear = byear ? byear : "NULL" ...
This part of the script uses AWK pattern NF == 7
to filter input lines
containing just 7 fields, namely the login name, the registration timestamp,
the full name, the email address, the gender, the birth year
and the password (in SHA1 form), from the user.data
file.
The email, gender and byear fields are allowed to be of null value,
so we set these values to the string NULL
whenever these
fields are empty.
Submitting queries
... if (spawk_submit("INSERT INTO `user` (" \ "`login`, " \ "`registration`, " \ "`name`, " \ "`email`, " \ "`gender`, " \ "`byear`, " \ "`password`" \ ") VALUES (" \ spawk_escape(login) ", " \ "FROM_UNIXTIME(" registration "), " \ spawk_escape(name) ", " \ email ", " \ gender ", " \ byear ", " \ spawk_escape(password) \ ")")) next if (!(++ucount % 1000)) print ucount, "users inserted" next ...
The above script section is the first SQL script to be submitted from AWK
via the SPAWK API. The function spawk_submit
takes a string argument,
that is the query to be submitted to the MySQL/MariaDB database server.
The specific SQL script is a DML script;
DML stands for Data Manipulation Language (INSERT, UPDATE, DELETE etc)
in contradistinction with DQL which stands for Data Query Language
(SELECT, SHOW etc).
DML queries don't produce any result rows, while DQL queries produce result rows.
When submitting a query via spawk_submit
function, a value is returned
to the caller; a value of 0 means that the query was processed successfully,
while non-zero values signal query failure.
This way one can check the returned value in order to proceed with further actions
or not.
In our case we skip to the next input line on failure, othewise we increment
a counter (ucount
) and print a porogress message for every 1000 users
successfully inserted.
Of course, the fact that some of the user lines may fail to be inserted into the database,
will cause problems later during the insertion of the relationships involving
the rejected users.
... NF == 3 { if (spawk_submit("INSERT INTO `relation` (" \ "`user`, " \ "`related`, " \ "`relationship`" \ ") VALUES (" \ spawk_escape($1) ", " \ spawk_escape($2) ", " \ spawk_escape($3) \ ")")) next if (!(++rcount % 1000)) print rcount, "relations inserted" next } ...
The above section refers to lines with 3 fields, that is lines of
the relation.data
file which contains 30000 user relationships.
For such lines an insertion SQL query is submitted, just as we did with the
user lines, and if the insertion fails we skip to the next input line,
else another counter is incremented and a progress message is printed every 1000
relationship rows inserted successfully.
{ print $0 ": syntax error" >"dev/stderr" }
The last part of the script has nothing to do with SPAWK but it's good to catch up input lines that do not conform to the given patterns of 7 and 3 fields per line. Now it's time to actually run the script and populate the database:
awk -f populate.awk user.data relation.data1000 users inserted 2000 users inserted 1000 relations inserted 2000 relations inserted 3000 relations inserted ... 29000 relations inserted 30000 relations inserted
That's it! Our chat database has just been populated with 2000 users and 30000 relationships, so let the games begin!
Counting users, relationships etc
Now that our chat database is populated with enough users and rleationships, we can run SPAWK scripts in order to obtain useful information. Most of this information can be extracted using plain vanilla SQL scripts, or other methods such as PHP MySQL API, Node.js etc, but SPAWK may save you a lot of time if you know how to use AWK effectively.
Counting genders
For start let's count male and female users using plain SQL:
SELECT `gender`, COUNT(*) FROM `user` GROUP BY `gender` ;
Use the standard MySQL/MariaDB database client mysql
to run the above script:
mysql --user="chat" --password="xxx" chat <gender.sql
and produce the following results:
gender COUNT(*) NULL 153 MALE 925 FEMALE 922
That is 925 male users, 922 female users and 153 users that have not specified their gender. Now let's submit the very same query via SPAWK:
@load "spawk" BEGIN { spawk_sesami["dbuser"] = "chat" spawk_sesami["dbpassword"] = "xxx" spawk_sesami["dbname"] = "chat" spawk_submit("SELECT `gender`, COUNT(*) FROM `user` GROUP BY `gender`", 0) while (spawk_fetchrow(count)) print count[0] }
Running the above script with SPAWK:
awk -f gender.awk
produces the following results:
153 MALE 925 FEMALE 922
Counting relationships by gender
As you might have been thinking, it's not of much use to run SPAWK instead of plain SQL just to count users by gender. The results are the same and we used SPAWK just to submit a query and display the result rows. But what if we make our query a little harder for plain SQL to carry out? Assume for example that we want to count relationships grouped by users' gender. Unless you are some kind of SQL wizard, you may find it difficult to produce that kind of report using plain SQL. So let's try SPAWK instead:
@load "spawk" BEGIN { OFS = "\t" spawk_sesami["dbuser"] = "chat" spawk_sesami["dbpassword"] = "xxx" spawk_sesami["dbname"] = "chat" spawk_submit("SELECT `login`, `gender` FROM `user`") while (spawk_fetchrow(user)) { spawk_submit("SELECT COUNT(*) FROM `relation` " \ "WHERE `user` = " spawk_escape(user[1])) spawk_fetchone(relcnt) count[user[2]] += relcnt[1] } for (gender in count) print gender, count[gender] }
Run the above script with SPAWK:
awk -f relgender.awk
and you'll get the following results:
2327 MALE 13936 FEMALE 13737
Counting friends/blocked relationships by gender
Let's make it even harder by adding the relatioship kind to the previous report. We don't want just how many relationships exist for the two genders, but we also want the kind of those relations.
@load "spawk" BEGIN { OFS = "\t" spawk_sesami["dbuser"] = "chat" spawk_sesami["dbpassword"] = "xxx" spawk_sesami["dbname"] = "chat" spawk_submit("SELECT `login`, `gender` FROM `user`") while (spawk_fetchrow(user)) { spawk_submit("SELECT `relationship`, COUNT(*) " \ "FROM `relation` WHERE `user` = " \ spawk_escape(user[1]) " GROUP BY `relationship`") while (spawk_fetchrow(fbcnt)) count[user[2] OFS fbcnt[1]] += fbcnt[2] } for (fbgender in count) print fbgender, count[fbgender] }
After running the above script with SPAWK we get the following results:
BLOCKED 893 FRIEND 1434 MALE FRIEND 8318 MALE BLOCKED 5618 FEMALE FRIEND 8177 FEMALE BLOCKED 5560
Popular and toxic users
We are now going to write a SPAWK script that scans user login names and prints popular users to the standard output sorted by popularity. Users considered popular if they have been marked more times as friend than as blocked by other users. Popularity is the count of friend type relationships for a given user.
@load "spawk" BEGIN { OFS = "\t" spawk_sesami["dbuser"] = "chat" spawk_sesami["dbpassword"] = "xxx" spawk_sesami["dbname"] = "chat" spawk_submit("SELECT `login` FROM `user`") while (spawk_fetchrow(user)) process_user(user[1]) } function process_user(login, relation, count) { spawk_submit("SELECT `relationship` FROM `relation` " \ "WHERE `related` = " spawk_escape(login)) while (spawk_fetchrow(relation)) count[relation[1]]++ if (count["FRIEND"] > count["BLOCKED"]) print login, count["FRIEND"] | "sort -k2n" }
After running the above script with SPAWK we get the following results:
acasley1u 3 glilleee 3 alinnitthk 4 cfarhertyb 4 ... jmcgraead 18 bbetje58 19 glaurentmq 19 rgagenfe 19 rvagg28 19 twickesb2 19 pcamamilena 23
To print toxic users instead of popular ones, just change the last lines to:
... if (count["FRIEND"] < count["BLOCKED"]) print login, count["BLOCKED"] | "sort -k2n" ...
Filtering popular/toxic users
Now it's time to write a filter for filtering login user names supplied as the first input field, in order to print only popular or only toxic users. By default our script will filter popular users read from the input stream; to print the toxic users we must use some kind of a flag variable.
@load "spawk" BEGIN { OFS = "\t" spawk_sesami["dbuser"] = "chat" spawk_sesami["dbpassword"] = "xxx" spawk_sesami["dbname"] = "chat" # Flag "poptox" is either "popular" or "toxic" and # can be specified in the command line. In case not # specified, then "poptox" is considered "popular". if (!poptox) poptox = "popular" } { delete count spawk_submit("SELECT `relationship` FROM `relation` " \ "WHERE `related` = " spawk_escape($1)) while (spawk_fetchrow(relation)) count[relation[1]]++ if (poptox == "popular") dif = count["FRIEND"] - count["BLOCKED"] else dif = count["BLOCKED"] - count["FRIEND"] if (dif > 0) print }
To filter popular users read from standard input, run:
awk -f poptox.awk
or
awk -v poptox="popular" -f poptox.awk
The program reads login names from the input, so you have to supply login names in order to get results in the output. We don't have any convenient list of login names to test our program, so let's create one:
@include "spawk.awk" BEGIN { spawk_sesami("chat", "xxx", "chat") spawk_submit("SELECT `login`, UNIX_TIMESTAMP(`registration`) FROM `user`") while (spawk_fetchrow(user)) { if ((user[2] % 100) > 95) print user[1] } }
We use the registration date as an epoch timestamp to filter a random set of user login names:
awk -f urand.awk >test1
Running the above script gives a total of 69 user login names in the file
test1
.
Let's dance!
To filter out the popular users from the test1
file:
awk -f poptox.awk test1
There exist 53 popular users in our list. For the toxic users run:
awk -v poptox="toxic" -f poptox.awk test1
There are just 11 toxic users in the list.
Let's use comm
to check the results:
comm -12 <(awk -f poptox.awk test1 | sort) <(awk -v poptox="toxic" -f poptox.awk test1 | sort)
We see no results, so there are exist no users filtered both as popular or as toxic
by the poptox.awk
script and that's a good sign.
No let's extract the users of the test1
list that didn't
filtered neither as popular nor as toxic users:
comm -23 <(sort test1) <(sort <(awk -f poptox.awk test1) <(awk -v poptox="toxic" -f poptox.awk test1))
And voilĂ ! The five missing users are:
abisgrovenp ajzhakovdp gcoramk9 hjozsefd ksummerill4x
Astern propulsion
Now let's use SPAWK one more time to check the relationships of those low profile users that seem to have equal number of friends and blockers:
@load "spawk" BEGIN { OFS = "\t" spawk_sesami["dbuser"] = "chat" spawk_sesami["dbpassword"] = "xxx" spawk_sesami["dbname"] = "chat" } { delete count spawk_submit("SELECT `relationship` FROM `relation` " \ "WHERE `related` = " spawk_escape($1)) while (spawk_fetchrow(relation)) count[relation[1]]++ print $1, count["FRIEND"], count["BLOCKED"] }
Running the above script for the five non popular neither toxic users found in our list, we get the expected results of equal number of friend/blocked relationships for each one of these users:
abisgrovenp 9 9 ajzhakovdp 5 5 gcoramk9 10 10 hjozsefd 8 8 ksummerill4x 10 10
Recreating the chat database with SPAWK
Now it's time to use SPAWK for submitting DDL queries.
For this we recreate the chat database using SPAWK.
As you might have already noticed, the schema.sql
file follows a simple rule:
the query terminator character ";
" is always placed at a line by itself.
Following this simple rule we can write a SPAWK script to create the chat
database and the relevant database users:
@load "spawk" # We intend to create a new database, so we have to use an administrative # database account. BEGIN { spawk_sesami["dbuser"] = "root" spawk_sesami["dbpassword"] = spawk_getpass("Enter root password: ") } # Skip blank lines. /^[[:space:]]*$/ { next } # Skip comment lines. /^[[:space:]]*--.*$/ { next } # Line containing just a ";" character are considered to be query # terminators, so run the query constructed already. /^[[:space:]]*;[[:space:]]*$/ { spawk_submit(query) # After submitting the query, reset the query to an empty string. query = "" next } # Add this line to the current query. { query = query " " $0 }
Run the above query using SPAWK:
awk -f create.awk schema.sqland populate the database:
awk -f populate.awk user.data relation.data