I have a very large dataset of HTML tables (extracted originally from Wikipedia). I want to extract meaningful tripleSet from each of these tables (This is not to be conflicted with extracting triples from wikipedia infoboxes which is relatively a lot easier task).
The triples has to be semantically meaningful, to the humans, not like DBpedia where triples are extracted to be URIs and other formats. So I am ok with just extracting the table text values.
Keep in mind the variety of table orientation and shapes. The main task I see is to extract the main Entity of the table records (The student name in a school record for example), so that it can be used as the triple's "Subject".
Example
for a table like this, we should know the main entity is "Server" and the others are only objects, so relations should be like:
<AOLserver> <Developed by> <NaviSoft>.
<AOLserver> <Open Source> <Yes>.
<AOLserver> <Software license> <Mozilla>.
<AOLserver> <Last stable version> <4.5.1>.
<AOLserver> <Release date> <2009-02-02>.
Also, keep in mind that not always the main Entity lies in the First column of the table, there's even tables that are not by any means talk about the same subject.
This is a table where the main Entity is the last column not the first:
This table should generate relations like:
<Arsène Wenger> <Position> <Manager>.
<Steve Bould> <Position> <Assistant manager>
Questions
My first question is can this be done using rule based methods, to craft some rules around examples and try to generalize so that I can detect the right Entity? can you suggest example rules?
Second question is about evaluation, how can I evaluate such a system? how can I measure my performance, so that I can enhance it?
So, Finally I've been able to achieve the goal of my project, it required a lot of work and testing but it was achieved.
The idea rested mainly in a pipeline like the following:
1-a component to extract the tables and import them into an in-memory object
2-a component to exclude bad tables, these are things that are used in table tags but they're not really tables (sometimes the writers of a page want to organize data appearance, so they put them in a table)
3- a component to strip off the styling of the tables and also to resolve column/row spans by repeating the data by the number of the span
4-a Machine learning based classifier to classify the orientation of the table (horizontal/vertical) and the header row/column for that table.
5-a Machine learning based classifier to classify the rows/columns that should be the "subject" of the relationship triple < subject > < predicate > < object >
The first classifier is a support vector machine classifier that takes features like character count, table/row cells count ratio, numbers to text ratio, capitalization..etc. we achieved about 80~85% on both precision and recall
The second classifier is a Random Forest classifier that takes features that are more related to the relevance of cells inside one row/column. we achieved about 85% also on both precision and recall.
some other refinement components and heuristics was involved in the process to make the output more clear and related to the context of the table
Generally there were no additional data used from Wikipedia to make the tool more general to any html table on the web. but the training data of the classifiers were mainly biased towards Wikipedia content!
I'll be updating the question code with the source code once it's finalized.