I've undertaken a small project recently where I have to filter an Excel files' table contents. I plan to use C# and EPPLUS for this project and convert the input Excel table into a Datatable to handle all the data querying.
However, the Excel files I've gotten from the client are of dubious formatting where the tables are all over the place on the sheet with filler descriptions and blank spaces in between. Each Excel file seems to have its own way of arranging the data too, the following being an example of what I've seen so far:
I've only done some experimenting with converting an Excel table to C#, but from what I've seen so far the system basically iterates row by row of data on a sheet starting from cell A1. Is there a way to detect a table during this iteration process, and separate them from the title/description/empty rows? I thought of simply checking for table borders to determine if there is a table or not, but with Scenario D some of the Excel files didn't even set table borders.
I plan to inform the client on this so we can agree on a standard formatting for the Excel file, but this is my first project working with interpreting Excel files so I also have a small meta question: Would this be the right thing to do? I'm assuming these different Excel formatting may be due to business choices (ie. from different departments), so for those who have encountered this sort of problem in the past would it be within the norm to convince the client to change their current methods or should I just go along with what they currently work with?
As mentioned in the comments, that is a real mess if the "tables" are just scattered clusters of cells. You would have to come up with some very elaborate AI to try to detect the various scenarios that are possible.
If it is not likley to get the client to alter their practice but they are willing to adjust just a little I would suggest using true ExcelTables
. With this, excel has done the work for you by way of the client or you applying the tables. For example, here are two random tables in a worksheet:
Note that I just copy/pasted the same cells BUT I then formatted both as tables via the button in the upper right corner. This does more then just make it look nice - it actually create an ExcelTable
object that you can directly reference in EPPlus. Here is something that will demo that:
public void Detect_Tables_Test()
{
//http://stackoverflow.com/questions/37901408/interpreting-an-excel-file-in-c-sharp
var fileInfo = new FileInfo(@"c:\temp\DetectTablesTest.xlsx");
using (var pck = new ExcelPackage(fileInfo))
{
var workbook = pck.Workbook;
var worksheet = workbook.Worksheets.First();
var tables = worksheet.Tables;
tables.ToList().ForEach(table =>
{
Console.WriteLine($"{{Name: {table.Name}, Address: {table.Address}, Columns: {table.Columns.Count}}}");
});
}
}
Which gives this in the output:
{Name: 'Table1', Address: 'G10:I20', Columns: 3}
{Name: 'Table2', Address: 'A1:C11', Columns: 3}
So now that you have the row/col addresses you can use that as reference to get the data, look for a description row above it, etc.