There are a lot of reasons you may wish to parse source code beyond compiling it's text. For example:
Using existing parsers for the fairly small tasks in which we are interested seems like killing flies with a sledge hammer - to much work and not enough reward. Our goals are to build a facility that is quick to deploy, can be easily ported to different platforms, and for which the parsing model can be built incrementally as we learn more about the work we are trying to accomplish.
That prototype has since been used in a couple of Doctoral research projects and by many of my graduate classes on a variety of projects. We've found it to be an effective facility for learning language structure in the classroom and building research tools in the lab.
Each token is a text word or group of punctuators. Some tokens are constrained to consist of a single instance of specialized characters like braces, brackets, and such. Often the transition between characters and white space are taken as token boundaries, but many special cases have been incorporated over years of use.
The tokenizer collects quoted strings and comments as single tokens, independent of their contents. You can ask the tokenizer to return or to throw away comment tokens. Aside from comments and white space, the tokenizer promises to return all the input stream's characters, in the order provided by the source. What it does is to segregate them into words called tokens. Thus it removes all textual structure of the source while preserving it's compilable information.
For code analysis we collect tokens into grammatical sequences, called SemiExpressions, by terminating the collection on semicolons, curly braces, or newlines if the line starts with "#". Each of these collections is usually sufficient to detect a single grammatical construct without including tokens from the next grammatical entity. If we are parsing XML text then we use a package called XmlParts which has similar responsibilities to the SemiExpression package.
The Parser package uses the scanner to collect token sequences for analysis. It is essentially a container of rule detectors that implement an IRule interface. Parser simply feeds the current token sequence to each rule in an internal rule container. When that is complete it requests another token sequence and repeats until there is nothing more to collect.
The Parser doesn't need to know anything about how its token sequences are collected nor how a rule will handle the sequence. It is simply a traffic cop that supplies the rules with what they need. Each rule has a collection of actions. When the rule is satisfied by a token collection it invokes its actions. It is up to the action to decide what to do with the information contained in the token sequence when its rule fires. Each action is handed a reference to a data repository to store and retrieve information to carry out its task. Note that the rules don't need to know what the actions do and the actions don't even know the rules exist. They just do their thing when asked.
For each parsing application, the only things that need to change are the rules and actions, the ConfigureParser builder that assembles the analyzer's parts, the Repository and Display. All the complex parts - tokenizer, SemiExpression, and Parser don't need to change at all. The rules and actions will probably be modeled after already existing rules and actions, so reuse in a new application is fairly straight forward.
Our research group has used the Parser as the analysis engine for dependency analysis, code restructuring, and some code visualization. This Parser design has allowed us to focus on asking and answering questions about code rather than the details of its syntactical analysis.