Parsing actually starts to get interesting with tokenisation. This is where input is finally converted into a form that the players couldn�t have typed in directly.
Here�s the gist of it. You take a pre-processed input line and chunk it into units called symbols. A symbol is one of:
1. Anything between matching quotation marks, or between an unmatched quotation mark and the end of the line. Example: "this is a string".
2. Any other piece of punctuation. Examples: ? ; , ! .
3. Any series of digits not bounded by alphabetical characters. Examples: 7, 26. Minus signs can count as part of integers too.
4. Any series of alphanumerical characters. Examples: WALK, DRAGON, COIN22.
5. Whatever special characters you use to talk to the tokenisation process directly. I�ll discuss these later.
OK, now what you need from the tokeniser is a list of tokens. These are nodes that represent multiple parts of speech (nouns, verbs etc.), of which the main part of the parser can then attempt to make sense. They usually consist of three elements:
1. A type identifier, so you know what kind of token it is.
2. Data (for freestyle tokens).
3. A set of parts of speech that the token can take.
For strings, the type will be some predefined constant, such as T_STRING, StringType or whatever your naming convention decrees. The data will be the body of the string, e.g. WHAT?!!. The set of parts of speech will contain some representation for nouns, and maybe also for verbs. I�ll write this as [noun, verb]. Don�t panic, I shall explain parts of speech in detail when I reach the main parsing process in a later article.
For integers, the type will be T_INTEGER or IntegerType or whatever, and the data will be a number such as 142857. The set of parts of speech will be at least [noun, adjective], with maybe some others thrown in too.
Punctuation marks will have their own predefined nodes. You can look them up in a table, it�s simple enough. If you like, you can point some of them to the same record, e.g. question marks and exclamation marks could be mapped to the same node as a full stop (my apologies to American readers, I know you call these latter two "exclamation points" and "periods").
This brings us to words...
The Vocabulary
Words must be translated into atoms (from the inheritance hierarchy, as I described earlier in this set of articles). The data structure linking the two is the vocabulary. This consists of a symbol table that connects words, parts of speech (PoS) and atoms. Here�s an extract showing what a vocabulary might contain:
Word PoS Atom Comment
<eat verb eat>
<egg noun egg>
<hit verb hit>
<orange colour adjectiveorange_colour> the colour
<orange noun orange> the fruit
<box verb hit> as in the sport of boxing
<box noun box> the container
If a player typed HIT ORANGE BOX then the tokeniser would need to look up all definitions of each word and the appropriate possible meanings, i.e.:
HIT <verb hit>
ORANGE <adjective orange_colour><noun orange>
BOX <verb hit><noun box>
This is done by means of a dictionary mechanism. I�m not going to go into the details of writing one of these � dictionaries are fairly common data structures. If you�re not using one from a library, a hash table with binary tree overflow usually does the business. So long as you have a reasonably efficient mechanism by which a word can be used to retrieve its matching record, that�s enough.
There are two further points to consider about vocabularies. Firstly, you might want to add a fourth component to each tuple to represent privilege level. If there are some commands that only admins should have, then there�s no reason these should be in the vocabulary for non-admins � it adds an extra level of security.
Secondly, some links need only be unidirectional. In the above example, the verb for BOX is just a synonym that points to the same atom as HIT. If during execution of [hit]() you wished to refer to the issuing command by looking backwards through the vocabulary, you wouldn�t want it to come up with BOX. Therefore, some kind of flag to note that a command is a synonym or an abbreviation is also in order.
Aside: if you did want [hit]() to refer to BOX then you would use
<box verb box_hit>
which when invoked would be [box_hit](). If box_hit were declared as a subclass of hit, then the code which was invoked would be the same as for [hit]() but when the action/verb atom was referred to it would come up as box_hit.