#ifndef TOKENIZER_H #define TOKENIZER_H ///////////////////////////////////////////////////////////////////// // Tokenizer.h - Reads words from a file // // ver 2.0 // // // // Language: Visual C++ 2008, SP1 // // Platform: Dell Precision T7400, Vista Ultimate SP1 // // Application: Prototype for CSE687 Pr1, Sp09 // // Author: Jim Fawcett, CST 4-187, Syracuse University // // (315) 443-3948, jfawcett@twcny.rr.com // ///////////////////////////////////////////////////////////////////// /* Module Operations: ================== This module defines a tokenizer class. Its instances read words from an attached file or string. Word boundaries occur when a character sequence read from the file or string: - changes between any of the character types: alphanumeric, punctuator, or white space. - certain characters are treated as single character tokens, e.g., "(", ")", "{", "}", "[". "]", ";", ".", and "\n". You can change the selection with Toker::setMode(Toker::code) or Toker::setMode(Toker::xml), or Toker::setSingleCharTokens(somestring); - Toker::code is the default tokenizing mode. - Double quoted and single quoted strings are returned as single tokens. You can turn off collecting single quoted strings as tokens by calling Toker::returnSingleQuotedStringsAsTokens(false). That is needed if you are parsing text with single quotes, as possessives, for example. A tokenizer is an important part of a scanner, used to read and interpret source code or XML. Public Interface: ================= Toker t; // create tokenizer instance returnComments(); // request comments return as tokens if(t.attach(someFileName)) // select file for tokenizing string tok = t.getTok(); // extract first token int numLines = t.lines(); // return number of lines encountered t.lines() = 0; // reset line count Build Process: ============== Required files - tokenizer.h, tokenizer.cpp Build commands (either one) - devenv Project1HelpS06.sln - cl /EHsc /DTEST_TOKENIZER tokenizer.cpp /link setargv.obj Maintenance History: ==================== ver 2.0 : 09 Jun 11 - deleted pIn in destructor ver 1.9 : 20 Feb 11 - TA Mehmet Kaya fixed bug handling many contiguous escape chars, reported by Ider (Jian) Zhang and Harshal Bhakta ver 1.8 : 07 Feb 10 - added returnSingleQuotedStringAsToken(bool doReturn) to support XML parsing where text body has single quotes. Note that if you set as false, then single quoted attribute values won't be returned as a single token. ver 1.7 : 31 Jan 09 - made putback(int) a public, rather than private, member function ver 1.6 : 22 Jan 09 - added Toker::setSingleTokenChars(somestring) - added notes about setMode and setSingleTokenChars to Manual and Maintenance information. ver 1.5 : 17 Jan 09 - added Toker::mode, either code or xml - changed container for single char token ids from vector to string - added test for end of file in isTokEnd() to fix bug, found by Jim DelloStritto that sometimes caused Toker to fail returning last character if not a singleCharToken ver 1.4 : 10 Feb 07 - fixed bug in braceCount to eliminate changing count when brace is in a quoted string or comment ver 1.3 : 24 Feb 06 - fixed bug in eat comment that hung on ending comment with no newline, by testing for stream state good. ver 1.2 : 06 Feb 06 - added stream closing to destructor and attach memeber functions ver 1.1 : 01 Feb 06 - added if test at end of getTok() to avoid returning space after C comment as a token ver 1.0 : 12 Jan 06 - first release */ // #include #include #include class Toker { public: enum mode { code, xml, custom }; Toker(const std::string& src = "", bool isFile = true); ~Toker(); void setMode(mode md); void setSingleCharTokens(std::string tokChars); bool attach(const std::string& filename, bool isFile = true); std::string getTok(); void returnComments(bool doReturn = true); void returnSingleQuotedStringAsToken(bool doCollect=true); int& lines(); int braceLevel(); bool isFileEnd(); int peek(); // peek at next char (not token) void putback(int ch); // put back char on stream (not token) enum state { default_state, comment_state, quote_state }; private: std::istream* pIn; char prevprevChar, prevChar, currChar, nextChar; std::string scTok; std::vector putbacks; int numLines; int braceCount; bool doReturnComments; bool doReturnSingleQuotesAsToken; bool aCppComment; enum state _state; mode _mode; bool EndQuoteCounter; // private helper functions int get(); bool getChar(); bool isSingleCharTok(char ch); bool isTokEnd(); void stripWhiteSpace(); bool isIdentifierChar(char ch); bool isBeginComment(); bool isEndComment(); std::string eatComment(); bool aSingleQuote; bool isBeginQuote(); bool isEndQuote(); std::string eatQuote(); // prohibit copying and assignment Toker(const Toker &tkr); Toker& operator=(const Toker&); }; inline void Toker::returnComments(bool doReturn) { doReturnComments = doReturn; } inline bool Toker::isFileEnd() { return (nextChar == -1); } inline int& Toker::lines() { return numLines; } inline int Toker::braceLevel() { return braceCount; } #endif