Author:
- Name: Andreas Gustafsson
Location: FI - Republic of Finland (Finland)
To build:
make all
NOTE: we FORCE DISABLE the optimiser because having it enabled causes some invocations to crash in some systems.
Bugs and (Mis)features:
The current status of this entry is:
STATUS: uses gets() - change to fgets() if possible
STATUS: INABIAF - please DO NOT fix
For more detailed information see 1992/gson in bugs.html.
To use:
./ag word word2 word3 < dictionary | less
Try:
./try.sh
The script first will use the mkdict.sh script to create a dictionary file out of the files README.md, try.sh (itself) and Makefile runs the program with a number of strings on that dictionary file.
Then the script will determine (or try and determine) where your system dictionary is located and assuming that it can find one it’ll use that. It checks the following locations though the last one is more ironic:
/usr/share/dict/words
/usr/share/lib/spell/words
/usr/ucblib/dict/words
/dev/null
# <– for machines with nothing to say
If a file is found then it will run the same process as above. The reason the made up dictionary is used first is it provides shorter and a bit more fun output.
Judges’ remarks:
Recently some newspapers printed amusing anagrams of one of the names listed above. Run this program to find the anagrams they weren’t allowed to print!
The author provided an obfuscated script that can be used to construct dictionaries which has been put in as mkdict.sh. To use try:
cat README.md | ./mkdict.sh > words
Then try using the program as shown above with the file words
.
Author’s remarks:
The name of the game:
AG
is short for either Anagram Generator
or simply AnaGram
. It might also be
construed to mean Alphabet Game
, and by pure coincidence it happens to be the
author’s initials.
What it does
AG
takes one or more words as arguments, and tries to find anagrams of those
words, i.e. words or sentences containing exactly the same letters.
How to use it
To run AG
, you need a dictionary file consisting of distinct words in the
natural language of your choice, one word on each line. If your machine doesn’t
have one already, you can make your own dictionary by concatenating a few
hundred of your favourite Usenet articles and piping them through the following
obfuscated shell script:
#!/bin/sh
z=a-z];tr [A-Z\] \[$z|sed s/[\^$z[\^$z*/_/g|tr _ \\012|grep ..|sort -u
Using articles from alt.folklore.computers
is likely to make
a more professional-looking dictionary than rec.arts.erotica
.
AG
must be run with the dictionary file as standard input.
Because anagrams consisting of just a few words are generally more
meaningful than those consisting of dozens of very short words, the
number of words in the anagrams is limited to 3 by default. This
limit can be changed using a numeric command line option, as in
./ag -4 international obfuscated c code contest </usr/dict/words
.
Bugs and limitations
- There is no error checking.
- Standard input must be seekable, so you can’t pipe the dictionary into
AG
. - The input sentence and each line in the dictionary may contain at most 32 distinct letters, and each letter may occur at most 15 times.
- Words in the dictionary may be at most 255 bytes long.
AG
cannot handle characters that sign-extend to negative values.- Although
AG
works on both 16-bit and 32-bit machines, the size of the problems it can solve is severely limited on machines that limit the stack size to 64k or less.
NOTICE to those who wish for a greater challenge:
If you want a greater challenge, don’t read any further: just try to understand the program via the source.
If you get stuck, come back and read below for additional hints and information.
Obfuscatory notes
As you can see, AG
takes advantage of the new ‘92 whitespace rules’ to
achieve a clear, readable, self-documenting layout. The identifiers
have been chosen in a way appropriate for an alphabet game, and common
sources of bugs such as goto statements and malloc/free have been
eliminated. As AG
also refrains from abusing the preprocessor, it
doesn’t really have much to offer in terms of “surface obfuscation”.
Instead, it tries to achieve both its speed and its obscurity through a
careful choice of algorithms. Some of the finer points of those
algorithms are outlined the section below.
How this entry works:
Here follows a description of some of the data structures and
algorithms used by AG
. It is by no means complete, but it may help
you get an idea about the general principles.
Internally, AG
represents words and sentences as arrays of 32
4-bit integer elements. Each element represents the number of
times a letter occurs in the word/sentence. There are 32 elements
because 32 is a convenient power of two larger than the number of
letters in most western alphabets, and the elements are 4 bits
each because the same letter is unlikely to occur more than 15
times in a practical anagram generation problem.
These 32*4
-bit arrays are actually stored in memory in a
“bit-transposed” format, as arrays of four long
values. It is
assumed that a long
is at least 32 bits. The first 4-bit letter
count is formed by the least significant (2^0
) bit in each of the
four long
s, the next one is formed by the 2^1
bits, etc.
This storage format makes it possible to add or subtract two such vectors of 32 4-bit values in parallel by simulating a set of 32 binary full adders in software using bitwise logical operations. E.g., all the LSBs of the result are formed in parallel by taking the exclusive OR of the LSBs in each summand, and 32 carry bits are formed in parallel in a similar way using a logical AND. Thus, 32 independent 4-bit additions can be performed by just four iterations of a loop containing some 32-bit bitwise logical operations, but no arithmetic operations other than those implied by array indexing.
Subtraction works similarly, and in fact AG
only implements
subtraction directly, handling addition by means of the identity
a+b = a-(0-b)
.
In addition to this 32*4
-bit representation, AG
also forms a so-called
“signature” that is the bitwise OR of the four long
s, which is
equivalent to saying that the signature of a word contains a logical 1
in the bit positions corresponding to letters occurring at least once
in that word.
The first thing AG
does is to construct a lookup table of 256
long
s, one for each 8-bit character value. The entry for a
character will be zero if that character doesn’t appear in the
sentence given on the command line, or it will have a single bit
set if the character does appear in the sentence. By adding
together the bit masks for all the letters in the input sentence
using the transpose addition method described above, AG
forms the
32*4
bit array representation of the input sentence.
The next action performed is reading the dictionary. Those words that
contain letters not in the input sentence are immediately discarded.
Words containing the right letters but in excessive numbers are
eliminated in a separate check involving the 32*4
bit array.
The remaining words, which will be referred to as “candidate words”,
are stored in 32*4
-bit representation, together with their signatures
and offsets into the dictionary file so that the plain-text version of
a word can later be found for printing. This information is kept in a
local struct
in the dictionary-reading function, and memory is
allocated for each candidate word simply by making another recursive
call to that function.
Each struct
so allocated is linked into a fixed-size hash table of
4096 entries indexed by the 12 low bits of the word’s signature.
When the dictionary-reading function encounters end-of-file, all the
candidate words have been stored in nested activation records on the
stack, accessible through the hash table.
Generating the anagrams is then done by traversing the hash table and subtracting the letters of each word in the hash table from the “current sentence”, which initially is the sentence given on the command line.
The subtraction is performed in parallel on the 4-bit letter counts
as described above, and if all 32 results are zero, an anagram has
been found. If the result is negative for one or more of the letters
(as indicated by one or more 1
s in a vector of 32 borrow bits
returned by the subtraction routine), the word did not match the
current sentence and is ignored. Finally, if the result contained
only non-negative letter counts, we have found a partial anagram:
a word containing some, but not all, of the letters in the current
sentence. In this case we recursively try to find an anagram of the
remaining letters. The depth of the recursion is limited to the
maximum number of words in the anagram, as specified by the user.
When the deepest recursion level has been reached, an optimization can
be applied: because no further recursion will be done, there is no
need to look for partial anagrams, and therefore AG
only needs to
check for words that contain exactly the same letters as the current
sentence. Those words can be found simply by indexing the hash table
with the signature of the current sentence.
Even when not on the deepest recursion level, AG
generally avoids
examining all the entries of the hash table. The idea is that we are
not interested in hash buckets whose words contain any letters not
in the current sentence; these buckets are exactly those whose index
has a logical one in a bit position where the signature of the current
sentence has a zero. Put another way, we want to loop through only
those hash bucket indices i
that contain zeroes in all the bit
positions where the signature s
of the current sentence contains
a zero; this can be expressed in C as (i & ~s == 0
).
It is possible to loop through all such numbers in an efficient way by
taking advantage of certain properties of binary arithmetic: by
forcing the bits corresponding to zeroes in s
to ones, we can make
the carries generated in incrementing i
propagate straight across
those bits that should remain zero. For example, the following
program prints all those 16-bit integers that contain zeroes in all
even bit positions:
main(){int i=0,s=0xAAAA;do{printf("%04x\t",i);}while(i=((i|~s)+1)&s);}
AG
uses a similar method but works in the opposite direction, finding
the next lower value with zeroes in given bit positions by propagating
borrows across those bits. Some additional adjustments are made
to the hash table index when initiating a recursive search, using
similar bit-twiddling techniques.
Whenever an anagram has been found, it is printed by traversing a
linked list formed by struct
s in the activation records of the
recursive invocations of the search function, seeking to the beginning
of the word matched by that invocation, and copying the characters of
the word directly from standard input to standard output.
Inventory for 1992/gson
Primary files
- gson.c - entry source code
- Makefile - entry Makefile
- gson.orig.c - original source code
- try.sh - script to try entry
- mkdict.sh - author supplied obfuscated script to make a dictionary
Secondary files
- 1992_gson.tar.bz2 - download entry tarball
- README.md - markdown source for this web page
- .entry.json - entry summary and manifest in JSON
- .gitignore - list of files that should not be committed under git
- .path - directory path from top level directory
- index.html - this web page