Sample Project #1

Purpose:

The four sample projects focus on building software tools for code analysis in a series of steps, one step for each project. We will emphasize C++ code but want our tools to be easily extendable to other similar languages like C# and Java.

Code analysis consists of extracting lexical content from source code files, analyzing the code's syntax from its lexical content, and building an Abstract Syntax Tree (AST) that holds the results of our analysis. It is then fairly easy to build several backends that can do further analyses on the AST to construct code metrics, search for particular constructs, or some other interesting features of the code.

You will find it useful to look at the Parsing blog for a brief introduction to parsing and code analysis.

In this first project we will build and test a lexical scanner in C++ that consists of two packages:

Tokenizer
extracts words, called tokens, from a stream of characters. Token boundaries are white-space characters, transitions between alphanumeric and punctuator characters, and comment and string boundaries. Certain classes of punctuator characters belong to single character or two character tokens so they require special rules for extraction.
SemiExpression
groups tokens into sets, each of which contain all the information needed to analyze some grammatical construct without containing extra tokens that have to be saved for subsequent analyses. SemiExpressions are determined by special terminating characters: semicolon, open brace, closed brace, newline when preceeded on the same line with '#', and colon when preceded by one of the three tokens "public", "protected", or "private".

Requirements:

Your Lexical Scanner:

Shall be written in C++, using the standard C++ libraries. You may also use helper code provided in the course Repository.
Shall use Visual Studio, Community Edition available at no cost.
The tokenizer package shall provide a Toker class implemented using the State Pattern¹.
Single and double quoted strings shall be collected as single tokens. Comments shall be discarded².
The tokenizer shall support the collection of specified single characters as tokens, even if surrounded by other punctuators. Please make the characters { '.', ';', ':', '#', '=', '>', '<', '[', ']', '{', '}', '\n' } default single character tokens.
The tokenizer shall supercede the above for specified two character sequences { "<<", ">>", "==", "::" } which shall be collected as two character tokens.
Shall provide tokenizer member functions to provide alternate one and two character token sequences.
Shall provide a semi-expression package as specified below.
The semi-expression package shall provide a SemiExp class that holds an ordered sequence of tokens, collected from the tokenizer. The order is the order retrieved from the Toker class.
This package shall use the characters { '{', '}', ';', ':', '\n' }. The character ':' is a terminator only if preceded by the keywords "public", "protected", or "private". The character '\n' is a terminator only if '#' is the first character on the line containing '\n'.
Instances of the SemiExp class shall be indexable, with input an integer, and output the token at that index of the token collection held by the instance.

You may find that additional member functions will be useful for parsing C++ source code.

Here's a reasonably useful tutorial on the State Pattern.
You will find it useful to define character consumers as states, e.g., eatWhiteSpace, eatAlphNum, eatPunctuator, ...
You may wish to make collecting comments as single tokens an option, but if so, please make discarding comments the default.

What you need to know:

In order to successfully meet these requirements you will need to know:

The definition of the term package and have looked carefully at a few examples.
How to use C++ inheritance and some of the simpler STL containers, e.g., std::vector<std::string>.