I have a datatable that include over 300.000 records and this equals over 2.000.000 words. I have an algorithm that get all words from each records and working well but I feel it is looks like little slow.
I want to speed up this program.
Program works like that;
I search some titles about speed up this program.
But I worry about that check each word contains or make a word field primary (I don't know it works or not) or getting all words without checking and after all works using a distinc.
Can you give me some advice. Thanks.
DataTable dtProducts = getProducts();
bool contains;
string[] keys;
dt.Columns.Add("keyword", typeof(string));
for (int i = 0; i < dtProducts.Rows.Count; i++) //114040
{
keys = GetWords(dtProducts.Rows[i]["name"].ToString().Trim());
foreach (string key in keys)
{
DataRow dr = dt.NewRow();
dr["keyword"] = key;
contains = dt.AsEnumerable().Any(row => key == row.Field<string>("keyword"));
if (!contains)
{
dt.Rows.Add(dr);
}
}
}
Probably the most efficient way is to filter in the database and not to load all into memory first. But you can also improve performance by not loading all into a DataTable
but if you use a SqlDataReader
and a HashSet<string>
instead:
private static readonly char[] WordSeparator = { ' ', '\t', ',', '.', ':' }; // to be continued
// ....
HashSet<string> allUniqueWords = new HashSet<string>();
using (var con = new SqlConnection(Properties.Settings.Default.ConnectionString))
using (var cmd = new SqlCommand("SELECT DISTINCT ColumnName FROM dbo.TableName", con))
{
con.Open();
using (var rd = cmd.ExecuteReader())
{
string[] words = rd.GetString(0).Split(WordSeparator, StringSplitOptions.RemoveEmptyEntries);
foreach (string word in words)
allUniqueWords.Add(word);
}
}