T B H P N

Project #1 - Lexical Scanner OCD

Using State-Based Tokenizer

Version 1.1
Due Date: Wednesday, September 12th, 2018
Project #1 helper files

Purpose:

The acronym OCD stands for Operational Concept Document. It's purpose is to make you think critically about the design and implementation of a project before committing to code. It also serves to publish your concept to the development team, which for this course is you (and only you). For this project we will be writing an Operational Concept Document for the remaining projects, e.g., Projects #2, #3, and #4.

One focus area for this course is understanding how to structure and implement big software systems. By big we mean systems that may consist of hundreds or even thousands of packages1 and perhaps several million lines of code. We won't be building anything quite that large, but our projects may be considerably bigger than anything you've worked on before.

In order to successfully implement big systems we need to partition code into relatively small parts and thoroughly test each of the parts before inserting them into the software baseline2. As new parts are added to the baseline and as we make changes to fix latent errors or performance problems we will re-run test sequences for those parts and, perhaps, for the entire baseline. Managing that process efficiently requires effective tools for code analysis as well as testing. How we do that code analysis is illustrated by the projects for this year.

The projects this Fall focus on building software tools for code analysis. We will emphasize C# code but want our tools to be easily extendable to other similar languages like C++ and Java.

Code analysis consists of extracting lexical content from source code files, analyzing the code's syntax from its lexical content, and building an Abstract Syntax Tree (AST) that holds the results of our analysis. It is then fairly easy to build several backends that can do further analyses on the AST to construct code metrics, search for particular constructs, evaluate package dependencies, or some other interesting features of the code.

You will find it useful to look at the Parsing blog for a brief introduction to parsing and code analysis.

In the second project we will build and test a lexical scanner in C# that consists of two packages:

For this first project, you will prepare an Operational Concept Document for the Lexical Scanner of Project #2. Here, you describe the project concept, e.g., the Users and Uses of the Lexical Scanner, its package structure, any important implementation ideas, and critical issues3:

Requirements:

Your Lexical Scanner OCD:
  1. Shall be prepared as a document with text and diagrams that describe your concept for this project.
  2. Shall use Universal Modeling Language diagrams, e.g., package, activity, and perhaps class diagrams.
  3. Shall discuss users and Uses of the Lexical Scanner. Note that users for Project #2 are other software packages, but the uses includes extracting information useful to the people that use our analysis software.
  4. For each diagram, shall provide text that explains what the diagram contains, why it appears in the document, and what conclusions the reader should draw from the diagram.
  5. Shall discuss the Tokenizer package that declares and defines a Toker class that implements the State Pattern4 with an abstract ConsumeState5 class and derived classes for collecting the following token types:
    • alphanumeric tokens
    • punctuator tokens
    • special one6 and two7 character tokens with defaults that may be changed by calling setSpecialSingleChars(string ssc) and/or setSpecialCharPairs(string scp).
    • Single-line comments returned as a single token, e.g., //
    • Multi-line comments returned as a single token, e.g., /* ... */
    • quoted strings8
  6. Shall discuss a SemiExpression package that contains a class SemiExp used to retrieve collections of tokens by calling Toker::getTok() repeatedly until one of the SemiExpression termination conditions, below, is satisfied.
  7. Shall terminate a token collection after extracting any of the single character tokens: semicolon, open brace, closed brace. Also on extracting newline if a 'using' or '#' is the first token on that line.
  8. Shall provide a facility providing rules to ignore certain termination characters under special circumstances. You are required to provide a rule to ignore the (two) semicolons within parentheses in a for(;;) expression9.
  9. The SemiExp class Shall implement the interface ITokenCollection with a declared method get().
  10. Shall discuss provision of an automated unit test suite that exercises all of the special cases that seem appropriate for these two packages10.
  11. Shall discuss critical issues related to the development of the Lexical Scanner, and the results of that development, e.g., discuss both process and product.

  1. In C#, a package is a single source code file that contains:
    • prologue, providing a name, brief descriptive phrase, author information, and environment information
    • description of the package's responsiblities and required files
    • maintenance history
    • class definitions
    • a main function that implements construction tests for all the defined code
  2. A software baseline is the set of all code considered to be part of the current system, excluding experimental code that individual developers are working on.
  3. For commercial and industrial packages, critical issues usually are concerned with: performance, scalability, safety for health and wealth, and complexity.
  4. https://en.wikipedia.org/wiki/State_pattern
  5. You don't have to use the ConsumeState name. In my demo code I used the name TokenState.
  6. Special one character tokens: <, >, [, ], (, ), {, }, :, =, +, -, *, \n
  7. Special two character tokens: <<, >>, ::, ++, --, ==, +=, -=, *=, /=
  8. "abc" becomes the token abc and the outer quotes are discarded. "\"abc\"" becomes the token "abc" with the outer quotes discarded.
  9. This will be discussed in class.
  10. This is in addition to the construction tests you include as part of every package you submit.

What you need to know:

In order to successfully meet these requirements you will need to know:
  1. Basics of the C# language: C# tutorial - PROGRAMIZ
  2. How to implement a simple class hierarchy. This will be covered briefly in lecture #3 and in more detail later.
  3. The .Net Containers.
  4. How to use Visual Studio. We will discuss this in one of the Help Sessions.