Approx how long should textsum training take to drop average loss to decent value

I am working on getting a textsum implementation working and recently got my own scraped data fed in. I started training last night against 38000 articles. This morning when I looked at the average loss, I was around 5.2000000. When I was playing with the textsum toy set, i was able to quickly get down to around 0.0000054 for example, however this was only against like 20 articles.

I was hoping that someone that has had a bit more experience, might be able to provide me some expectations into about how long training will take. I am currently running this on an Nvidia 980M. Last week I did want to try out AWS g2.2xlarge instance but I found that ironically it seemed that my local machine was processing things faster than the Grid 520's. I still want to test out the P2 instances and also Google Cloud, but for now I think I am just going to work with my local machine.

Any info anyone might be able to provide here, regarding what I should expect? Thanks!

Solution

So I'm just going to answer this myself since I can pretty much do so at this point. One thing that I found interesting from another post is that with a large dataset you really shouldn't train lower than 1 with regards to the 'average loss' value. This is because you then start getting into 'overfitting'. Therefore, in my current training against 40k articles using my laptop's Nvidia 980M, the vocab file has 65997 words and it has taken about a day on average to drop the 'average loss' a single whole number. So currently I am seeing numbers around 1.2 to 2.8.

------ Edit ------ When I ran decode against the data with my avg loss at the numbers above, the results were pretty bad. After thinking about this more, I realized that my dataset is probably not a "large" dataset. Those like Xin Pan and others that have access to the Gigaword dataset are running training against 1 million+ articles. Therfore I feel that my 40k articles is nothing in comparison. Also when the statement above was made, I'm not sure if he meant an average loss of 1 or would it be 0.01. Either way, I am now referring to Tensorboard to somewhat visualize "overfitting" and I am continuing my training until I get a lower avg loss. I will add to this at a later time when my results are better.

Hopefully this gives a little bit of a reference for those of you wondering the same.