As part of a class project to implement a compiler I am also implementing a hash table for use as the compiler's symbol table.
The implementation of the hash table is intended to be very low-level, a raw block of memory that is manually packed, and it is only intended to store Token objects. Thus, to optimize the hash table for serializability, I decided to simply inline the Tokens in the table, that is, simply memcpy the Token object into the table's memory when a Token is first inserted.
I am aware that one should not memcpy a class that has virtual functions or pointers, and that in general using memcpy on objects of a class is bad practice. However, the Token class does not have virtual functions or pointers as can be seen from its declaration below, and I would not use memcpy if this was not an exercise in low-level programming.
class Token {
public:
Token() : tag(BAD) {}
Token(Tag t) : tag(t) {}
Tag tag;
size_t getSize(){ return sizeof(Token); }
};
The problem that I am having, is that the hash table inserts the Tokens correctly, and it finds the same memory location when looking for the same key, however, when I try to access the members of the returned Token pointer, it appears that the data has been corrupted.
I wrote a program to test the symbol table on a simple input. The program does the following:
While the symbol table returns the same location in memory that it inserted the Token into, the Token's tag attribute no longer matches the original one that was inserted. Below is the log output when the program inserts the keyword program into the symbol table:
[debug] Entering SymbolTable.insert(const char* key)
[debug] bin: 48 Searching for the last element in this bin's linked list.
[debug] Last entry found... writing to entry's next location field.
[debug] Writing the new entry's identifier to the table.
[debug] The identifier: program has been written to the table.
[debug] The memory blocks are not equal prior to assignment.
[debug] The memory blocks are equal.
[debug] nextLoc: 571 tag: 46
[debug] Location of Token: 14287688
And below is the log output when the program subsequently looks up the identifier program in the symbol table.
[debug] Entering Lexer.getToken()
[debug] Entering SymbolTable.contains(const char* key)
[debug] Entering SymbolTable.find(const char* key) key: program
[debug] bin: 48
[debug] Search location: 541
[debug] Comparing key char: p to table char: p
[debug] Comparing key char: r to table char: a
[debug] Tag of entry: 1684368227
[debug] Location of Token: 14287653
[debug] Search location: 557
[debug] Comparing key char: p to table char: p
[debug] Comparing key char: r to table char: r
[debug] Comparing key char: o to table char: o
[debug] Comparing key char: g to table char: c
[debug] Tag of entry: 1920296037
[debug] Location of Token: 14287668
[debug] Search location: 0
[debug] Comparing key char: p to table char: p
[debug] Comparing key char: r to table char: r
[debug] Comparing key char: o to table char: o
[debug] Comparing key char: g to table char: g
[debug] Comparing key char: r to table char: r
[debug] Comparing key char: a to table char: a
[debug] Comparing key char: m to table char: m
[debug] Tag of entry: 1207959598
[debug] Location of Token: 14287688
The 1th token: 60
So as can be seen from the Location of Token message, the symbol table is finding the same location in memory where it wrote the Token to, but the tag of the Token is different. I'm stumped as to why this might be.
For completeness, here is the definition of the SymbolTable class.
template<class sizeType>
class SymbolTable{
public:
SymbolTable();
~SymbolTable();
Token* operator[](const char* key);
bool contains(const char* key);
Token* insert(const char* key, Token value);
private:
void* find(const char* key);
static const size_t nbins = 64;
sizeType nextLoc;
void* table;
size_t hash(const char* key){
return (size_t)key[0] % nbins;
}
};
And here is the source code for the symbol table's insert, find and operator[] functions.
template<class sizeType> Token* SymbolTable<sizeType>::insert(const char* key, Token value){
BOOST_LOG_TRIVIAL(debug) << "Entering SymbolTable.insert(const char* key)";
size_t bin = hash(key);
void *sizeType_ptr,
*tbl_char_ptr,
*idSize_ptr;
unsigned char idSize = 0;
const char *key_ptr = key;
Token *token_ptr = NULL;
// Find the last entry in this bin's linked list.
BOOST_LOG_TRIVIAL(debug) << "bin: " << bin
<< " Searching for the last element in this bin's linked list.";
sizeType_ptr = table + sizeof(sizeType)*bin;
while(*((sizeType*)sizeType_ptr) != 0){
sizeType_ptr = table + *((sizeType*)sizeType_ptr);
}
BOOST_LOG_TRIVIAL(debug) << "Last entry found... writing to entry's next location field.";
// Write the location of the new entry to this entry's next field.
*((sizeType*)sizeType_ptr) = nextLoc;
// Move to new entry's location.
sizeType_ptr = table + nextLoc;
// Write identifier
BOOST_LOG_TRIVIAL(debug) << "Writing the new entry's identifier to the table.";
tbl_char_ptr = sizeType_ptr + sizeof(sizeType) + sizeof(unsigned char);
while(*key_ptr != '\0'){
*((char*)tbl_char_ptr) = *key_ptr;
tbl_char_ptr = tbl_char_ptr + sizeof(char);
++key_ptr;
++idSize;
}
BOOST_LOG_TRIVIAL(debug) << "The identifier: " << key << " has been written to the table.";
// Write length of identifier.
idSize_ptr = sizeType_ptr + sizeof(sizeType);
*((unsigned char*)idSize_ptr) = idSize;
token_ptr = (Token*)(tbl_char_ptr + sizeof(char));
void *tk = &value,
*tb = token_ptr;
for(int i = 0; i < value.getSize(); ++i){
if(*((char*)tk) != *((char*)tb)){
BOOST_LOG_TRIVIAL(debug) << "The memory blocks are not equal prior to assignment.";
break;
}
}
memcpy((void*)token_ptr, &value, value.getSize());
bool areEqual = true;
for(int i = 0; i < value.getSize(); ++i){
if(*((char*)tk) != *((char*)tb)){
areEqual = false;
BOOST_LOG_TRIVIAL(debug) << "The memory blocks are not after assignment.";
break;
}
}
if(areEqual){
BOOST_LOG_TRIVIAL(debug) << "The memory blocks are equal.";
}
nextLoc = nextLoc + sizeof(sizeType) + sizeof(unsigned char)
+ idSize*sizeof(char) + value.getSize();
BOOST_LOG_TRIVIAL(debug) << "nextLoc: " << nextLoc
<< " tag: " << token_ptr->tag;
BOOST_LOG_TRIVIAL(debug) << "Location of Token: " << (size_t)token_ptr;
return token_ptr;
}
template<class sizeType>
void* SymbolTable<sizeType>::find(const char* key){
BOOST_LOG_TRIVIAL(debug) << "Entering SymbolTable.find(const char* key) "
<< "key: " << key;
bool found = false;
size_t bin = hash(key);
void *idSize,
*sizeType_ptr,
*tbl_char_ptr,
*result_ptr = NULL;
const char* key_ptr = key;
BOOST_LOG_TRIVIAL(debug) << "bin: " << bin;
sizeType_ptr = table + sizeof(sizeType)*bin;
while(!found){
found = true;
// Is this the last element in this bin?
if(*((sizeType*)sizeType_ptr) == 0){
result_ptr = NULL;
return result_ptr;
}
// Advance to the next element in this bin's linked list.
sizeType_ptr = table + *((sizeType*)sizeType_ptr);
idSize = sizeType_ptr + sizeof(sizeType);
tbl_char_ptr = idSize + sizeof(unsigned char);
BOOST_LOG_TRIVIAL(debug) << "Search location: " << *((sizeType*)sizeType_ptr);
// Check if the passed key matches the current key in the table.
for(int i = 0; i < *((unsigned char*)idSize); ++i){
BOOST_LOG_TRIVIAL(debug) << "Comparing key char: " << *key_ptr
<< "to table char: " << *((const char*)tbl_char_ptr);
// Check if the key is too short or if the chars do not match.
if(*key_ptr != *((const char*)tbl_char_ptr)){
found = false;
break;
}
++key_ptr;
tbl_char_ptr = tbl_char_ptr + sizeof(char);
BOOST_LOG_TRIVIAL(debug) << "*(const char*)tbl_char_ptr: "
<< *((const char*)tbl_char_ptr);
}
result_ptr = tbl_char_ptr + sizeof(char);
BOOST_LOG_TRIVIAL(debug) << "Tag of entry: " << ((Token*)result_ptr)->tag;
BOOST_LOG_TRIVIAL(debug) << "Location of Token: " << (size_t)result_ptr;
key_ptr = key;
}
return result_ptr;
}
template<class sizeType>
Token* SymbolTable<sizeType>::operator[](const char* key){
BOOST_LOG_TRIVIAL(debug) << "Entering SymbolTable.operator[](const char* key)";
void* void_ptr = find(key);
Token* token_ptr = (Token*)void_ptr;
return token_ptr;
}
And here is the test program's source code.
int main(){
cout << "Executing testLexer.cpp" << endl;
ifstream file("./pascalPrograms/helloworld.pas");
string program((istreambuf_iterator<char>(file)), istreambuf_iterator<char>());
cout << "program:\n\n" << program << endl;
int fileSize = program.length();
const char* buffer = program.c_str();
const char* scanp = buffer;
cout << "Instantiating Lexer" << endl << endl;
Lexer lexer = Lexer(scanp);
Token* tok;
int i = 0;
cout << "Entering while loop to fetch tags." << endl << endl;
do{
i++;
tok = lexer.getToken();
cout << "The " << i << "th token: " << tok->tag << endl;
} while(tok->tag != END_OF_FILE);
return 0;
}
Thank you in advance for your help! :D
Edit:
Here is the input Pascal program:
program Hello;
begin
writeln ('Hello, world.');
readln
end.
And to clarify the question:
Why is the tag of the Token retrieved from the symbol table different from the tag of the Token put into the symbol table when the Token in the symbol table is an exact copy of the original?
Found it. You're overwriting the first byte of your tag with an 'H'. The other bytes are fine. Now to find where that H is coming from...
nextLoc = nextLoc + sizeof(sizeType) + sizeof(unsigned char)
+ idSize*sizeof(char) + value.getSize();
You need to add one more here. You have the skip (sizeType), the length byte (unsigned char), the id itself (idSize * sizeof(char)) and the value (value.getSize()), but you also leave a byte between id and value that you're not accounting for. That's why the last byte of your tag is getting overwritten - and because you're testing on a little-endian machine that results in the highest byte being corrupted.
for(int i = 0; i < *((unsigned char*)idSize); ++i){
...
tbl_char_ptr = tbl_char_ptr + sizeof(char);
...
}
result_ptr = tbl_char_ptr + sizeof(char);
That's one more than idSize.