Search code examples
c#algorithmperformancelinqdatatable

C# DataTable Filter Fastest Way


I have a datatable that include over 300.000 records and this equals over 2.000.000 words. I have an algorithm that get all words from each records and working well but I feel it is looks like little slow.

I want to speed up this program.

Program works like that;

  1. Getting all records from database to datatable.
  2. Loop datatable and get all words from each record.
  3. For each word check new datatable contains that word, if not add to new datatable.

I search some titles about speed up this program.

  1. using foreach or for ? (I think, there is no difference)
  2. for checking datatable has words or not, using DataTable.Select() or Linq (actually Linq is faster)

But I worry about that check each word contains or make a word field primary (I don't know it works or not) or getting all words without checking and after all works using a distinc.

Can you give me some advice. Thanks.

        DataTable dtProducts = getProducts();
        bool contains;
        string[] keys;

        dt.Columns.Add("keyword", typeof(string));

        for (int i = 0; i < dtProducts.Rows.Count; i++) //114040
        {
            keys = GetWords(dtProducts.Rows[i]["name"].ToString().Trim());

            foreach (string key in keys)
            {
                DataRow dr = dt.NewRow();
                dr["keyword"] = key;

                contains = dt.AsEnumerable().Any(row => key == row.Field<string>("keyword"));

                if (!contains)
                {
                    dt.Rows.Add(dr);
                }
            }
        }

Solution

  • Probably the most efficient way is to filter in the database and not to load all into memory first. But you can also improve performance by not loading all into a DataTable but if you use a SqlDataReader and a HashSet<string> instead:

    private static readonly char[] WordSeparator = { ' ', '\t', ',', '.', ':' }; // to be continued
    // ....
    
    
    HashSet<string> allUniqueWords = new HashSet<string>();
    using (var con = new SqlConnection(Properties.Settings.Default.ConnectionString))
    using (var cmd = new SqlCommand("SELECT DISTINCT ColumnName FROM dbo.TableName", con))
    {
        con.Open();
        using (var rd = cmd.ExecuteReader())
        {
            string[] words = rd.GetString(0).Split(WordSeparator, StringSplitOptions.RemoveEmptyEntries);
            foreach (string word in words)
                allUniqueWords.Add(word);
        }
    }