Search code examples
c#.netexcellinqepplus

Stripping data from a EPPlus output, from a date range


Quick overview: The main goal is to read in the data from a set date, from the row and get the ref number form the set date e.g. Start date.

For example if i just wanted the data from the date set to the 1st of last month and upwards.

I currently have to extract some data from the excel spreadsheet example below:

Start date  Ref number
29/07/2015  2342326
01/07/2016  5697455
02/08/2016  3453787
02/08/2016  5345355
02/08/2015  8364456
03/08/2016  1479789
04/07/2015  9334578

enter image description here

Output using EPPlus:

29/07/2015
2342326
29/07/2016
5697455
02/08/2016
3453787
02/08/2016
5345355
02/08/2015
8364456
03/08/2016
1479789
04/07/2015
9334578

This part is fine, but when i try to strip the output via a date range i'm getting errors, e.g. using LINQ i get the following error output.

An unhandled exception of type 'System.InvalidCastException' occurred in System.Data.DataSetExtensions.dll

Additional information: Specified cast is not valid.

LINQ code:

var rowsOfInterest = tbl.AsEnumerable()
.Where(row => row.Field<DateTime>("Start date") >= new DateTime(2016, 7, 1))
.ToList();

I've also tried to modify the from date range using the datatable:

DataRow[] result = tbl.Select("'Start date' >= #1/7/2016#");

But get the following error:

An unhandled exception of type 'System.Data.EvaluateException' occurred in System.Data.dll

Additional information: Cannot perform '>=' operation on System.String and System.Double.

The last attempt was to try and see if i can strip the date from within the loop.

Code used:

DateTime dDate;
row[cell.Start.Column - 1] = cell.Text;
string dt = cell.Text.ToString();

if (DateTime.TryParse(dt, out dDate))
{
    DateTime dts = Convert.ToDateTime(dt);
}

DateTime date1 = new DateTime(2016, 7, 1);

if (dDate >= date1)
{
    Console.WriteLine(row[cell.Start.Column - 1] = cell.Text);
}

This sort of works but just lists the set dates and not there values, which is understandable, If i take this route how would i get the dates with there values?

Output:

29/07/2016
02/08/2016
02/08/2016
03/08/2016

The full code example used:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Data.OleDb;
using System.Text.RegularExpressions;
using Microsoft.Office.Interop.Excel;
using System.Data;
using System.IO;

namespace Number_Cleaner
{
    public class NumbersReport
    {

        //ToDo: Look in to fixing the code so it filters the date correctly with the right output data.
        public System.Data.DataTable GetDataTableFromExcel(string path, bool hasHeader = true)
        {
            using (var pck = new OfficeOpenXml.ExcelPackage())
            {
                using (var stream = File.OpenRead(path))
                {
                    pck.Load(stream);
                }
                var ws = pck.Workbook.Worksheets.First();
                System.Data.DataTable tbl = new System.Data.DataTable();
                foreach (var firstRowCell in ws.Cells[1, 1, 1, ws.Dimension.End.Column])
                {
                    tbl.Columns.Add(hasHeader ? firstRowCell.Text : string.Format("Column {0}", firstRowCell.Start.Column));
                }
                var startRow = hasHeader ? 2 : 1;
                for (int rowNum = startRow; rowNum <= ws.Dimension.End.Row; rowNum++)
                {
                    var wsRow = ws.Cells[rowNum, 1, rowNum, ws.Dimension.End.Column];
                    DataRow row = tbl.Rows.Add();
                    foreach (var cell in wsRow)
                    {

                        DateTime dDate;
                        row[cell.Start.Column - 1] = cell.Text;
                        string dt = cell.Text.ToString();
                        //Console.WriteLine(dt);

                        if (DateTime.TryParse(dt, out dDate))
                        {
                            DateTime dts = Convert.ToDateTime(dt);
                        }

                        DateTime date1 = new DateTime(2016, 7, 1);

                        if (dDate >= date1)
                        {
                            Console.WriteLine(row[cell.Start.Column - 1] = cell.Text);
                        }

                        //Console.WriteLine(row[cell.Start.Column - 1] = cell.Text);
                    }
                }
                //var rowsOfInterest = tbl.AsEnumerable()
                 // .Where(row => row.Field<DateTime>("Start date") >= new DateTime(2016, 7, 1))
                 //.ToList();
                //Console.WriteLine(tbl);
                //DataRow[] result = tbl.Select("'Start date' >= #1/7/2016#");

                return tbl;
            }
       }

Modified from: How to match date to row then get the final column value using EPPlus?


Solution

  • Based on your code, you are storing everything in your DataTable as strings by calling cell.Text. But using that you are loosing valuable information - the cell data type. You are much better off using cell.Value which will either be a string or a double. With Excel, dates, integers, and decimal values are all stored as doubles.

    The error you are seeing has to do with the fact that you store the values as string but query them like DateTime here:

    .Where(row => row.Field<DateTime>("Start date") >= new DateTime(2016, 7, 1))
    

    and here:

    "'Start date' >= #1/7/2016#"
    

    If you look at my post here: How to parse excel rows back to types using EPPlus you will see the helper function ConvertSheetToObjects which deals pretty much with what you are trying to do. With a little modification we can turn it into something that takes a WorkSheet and converts it to a DataTable. Like the Object converstion method you should still provide it the expected structure in the form as a DataTable passed it rather then having it try to guess it by casting cell values:

    public static void ConvertSheetToDataTable(this ExcelWorksheet worksheet, ref DataTable dataTable)
    {
        //DateTime Conversion
        var convertDateTime = new Func<double, DateTime>(excelDate =>
        {
            if (excelDate < 1)
                throw new ArgumentException("Excel dates cannot be smaller than 0.");
    
            var dateOfReference = new DateTime(1900, 1, 1);
    
            if (excelDate > 60d)
                excelDate = excelDate - 2;
            else
                excelDate = excelDate - 1;
            return dateOfReference.AddDays(excelDate);
        });
    
        //Get the names in the destination TABLE
        var tblcolnames = dataTable
            .Columns
            .Cast<DataColumn>()
            .Select(dcol => new {Name = dcol.ColumnName, Type = dcol.DataType})
            .ToList();
    
        //Cells only contains references to cells with actual data
        var cellGroups = worksheet.Cells
            .GroupBy(cell => cell.Start.Row)
            .ToList();
    
        //Assume first row has the column names and get the names of the columns in the sheet that have a match in the table
        var colnames = cellGroups
            .First()
            .Select((hcell, idx) => new { Name = hcell.Value.ToString(), index = idx })
            .Where(o => tblcolnames.Select(tcol => tcol.Name).Contains(o.Name))
            .ToList();
    
    
        //Add the rows - skip the first cell row
        for (var i = 1; i < cellGroups.Count(); i++)
        {
            var cellrow = cellGroups[i].ToList();
            var tblrow = dataTable.NewRow();
            dataTable.Rows.Add(tblrow);
    
            colnames.ForEach(colname =>
            {
                //Excel stores either strings or doubles
                var cell = cellrow[colname.index];
                var val = cell.Value;
                var celltype = val.GetType();
                var coltype = tblcolnames.First(tcol => tcol.Name ==  colname.Name).Type;
    
                //If it is numeric it is a double since that is how excel stores all numbers
                if (celltype == typeof(double))
                {
                    //Unbox it
                    var unboxedVal = (double)val;
    
                    //FAR FROM A COMPLETE LIST!!!
                    if (coltype == typeof (int))
                        tblrow[colname.Name] = (int) unboxedVal;
                    else if (coltype == typeof (double))
                        tblrow[colname.Name] = unboxedVal;
                    else
                        throw new NotImplementedException($"Type '{coltype}' not implemented yet!");
                }
                else if (coltype == typeof (DateTime))
                {
                    //Its a date time
                    tblrow[colname.Name] = val;
                }
                else if (coltype == typeof (string))
                {
                    //Its a string
                    tblrow[colname.Name] = val;
                }
                else
                {
                    throw new DataException($"Cell '{cell.Address}' contains data of type {celltype} but should be of type {coltype}!");
                }
            });
    
        }
    
    }
    

    To use it on something like this:

    enter image description here

    You would run this:

    [TestMethod]
    public void Sheet_To_Table_Test()
    {
        //https://stackoverflow.com/questions/38915006/stripping-data-from-a-epplus-output-from-a-date-range
    
        //Create a test file
        var fi = new FileInfo(@"c:\temp\Sheet_To_Table.xlsx");
    
        using (var package = new ExcelPackage(fi))
        {
            var workbook = package.Workbook;
            var worksheet = workbook.Worksheets.First();
    
            var datatable = new DataTable();
            datatable.Columns.Add("Col1", typeof(int));
            datatable.Columns.Add("Col2", typeof(string));
            datatable.Columns.Add("Col3", typeof(double));
            datatable.Columns.Add("Col4", typeof(DateTime));
    
            worksheet.ConvertSheetToDataTable(ref datatable);
    
            foreach (DataRow row in datatable.Rows)
                Console.WriteLine(
                    $"row: {{Col1({row["Col1"].GetType()}): {row["Col1"]}" +
                    $", Col2({row["Col2"].GetType()}): {row["Col2"]}" +
                    $", Col3({row["Col3"].GetType()}): {row["Col3"]}" +
                    $", Col4({row["Col4"].GetType()}):{row["Col4"]}}}");
    
            //To Answer OP's questions
            datatable
                .Select("Col4 >= #01/03/2016#")
                .Select(row => row["Col1"])
                .ToList()
                .ForEach(num => Console.WriteLine($"{{{num}}}"));
        }
    }
    

    Which gives this in the output:

    row: {Col1(System.Int32): 12345, Col2(System.String): sf, Col3(System.Double): 456.549, Col4(System.DateTime):1/1/2016 12:00:00 AM}
    row: {Col1(System.Int32): 456, Col2(System.String): asg, Col3(System.Double): 165.55, Col4(System.DateTime):1/2/2016 12:00:00 AM}
    row: {Col1(System.Int32): 8, Col2(System.String): we, Col3(System.Double): 148.5, Col4(System.DateTime):1/3/2016 12:00:00 AM}
    row: {Col1(System.Int32): 978, Col2(System.String): wer, Col3(System.Double): 668.456, Col4(System.DateTime):1/4/2016 12:00:00 AM}
    {8}
    {978}