c#database entity-framework .net-core many-to-many

How to prevent duplicate insert into many-to-many tables?

I have Article and Tag entites. Articles have Tags as well as Tags have Articles.

At first my entities were look like this:

public class Article : AuditableBaseEntity
{
    public string Title { get; set; }

    public string Content { get; set; }

    public virtual ICollection<ArticleTag> ArticleTags { get; set;} 
}

public class Tag : AuditableBaseEntity
{
    public string Name { get; set; }

    public virtual ICollection<ArticleTag> ArticleTags { get; set;} 
}

And I had this method for adding existing tag to an existing article:

public async Task<Article> AddExistingTagToArticle(Article article, Tag tag, CancellationToken cancellationToken)
    {
        ArticleTag articleTag = new(article, tag);

        article.ArticleTags.Add(articleTag);
        tag.ArticleTags.Add(articleTag);

        _context.ArticleTags.Add(articleTag);

        await _context.SaveChangesAsync(cancellationToken);

        return article;
    }

But later, when I try to add the same ArticleTag again, I got "can not track because another instance of ArticleTag with the same Id already already being track..." error.

But I was expecting a DbUpdate error for violating PK instead.

Later I decided to turn ICollection<ArticleTag> ArticleTags into HashSet<ArticleTag> ArticleTags so that I cannot add a duplicate record at c# level and I thought it is better than expecting an exception from db.

And so, I changed AddExistingTagToArticle method to this:

public async Task<Article> AddExistingTagToArticle(Article article, Tag tag, CancellationToken cancellationToken)
    {
        ArticleTag articleTag = new(article, tag);

        bool isAlreadyExists = !article.ArticleTags.Add(articleTag);

        if(isAlreadyExists){
             return article;
        }

        tag.ArticleTags.Add(articleTag);

        _context.ArticleTags.Add(articleTag);

        await _context.SaveChangesAsync(cancellationToken);

        return article;
    }

This solves my concern but am I doing it right? Is there a design issue? What are you doing in such case?

Solution

Depending on the completion state of the entities in question, EF may or may not give you the "instance already tracked" error vs. a Db exception on SaveChanges.

A Hashset for the many-to-many relationship is fine, in many cases the Hashset is recommended for collections to provide just that kind of first line of validation, but it will only work so long as the collection of ArticleTags is populated at the time you ad them. Otherwise you would end up with the Db Exception on SaveChanges when it tries to insert the duplicate.

When working with entities and DbContexts/DbSets, arguably the biggest factor for encountering problems is the lifespan of the DbContext and the entities themselves. For instance working with entities that out-live the DbContext that read them (I.e. Detached entities) brings a considerable amount of complexity with needing to re-attach the entities and deal with stale scenarios. Long-lived DbContexts end up becoming stale or bloated with tracked references.

For applications I like to keep my entity use short, and rely on projection to get out data that my consumers (i.e. views) need, then fetch the entities as needed, and with required relationships populated in order to perform the update. For instance if I have an MVC View/Controller to review an Article and that article displays tags and has an action to add a tag which I want persisted to the DB immediately, I might have something like:

public async Task<AddResult> AddExistingTagToArticle(int articleId, int tagId, CancellationToken cancellationToken)
{
    try
    {
        var article = await _context.Articles
            .Include(x => x.ArticleTags)
            .SingleAsync(x => x.ArticleId == articleId);

       if (article.ArticleTags.Any(x => x.TagId == tagId))
           return AddResult.Failure("Tag is already assigned to this article.");

       var tag = await _context.Tags.SingleAsync(x => x.TagId == tagId);
       article.ArticleTags.Add( new (article, tag));
       await _context.SaveChangesAsync();
    
       return AddResult.Success("Tag added to article.");
    }
    catch(Exception ex)
    {
        var logId = _logger.Exception(ex);
        return AddResult.Failure($"Tag could not be added to article. (Ref #{logId}");
    }
}

This assumes a more async operation that the client code would then refresh the UI or take some action to reload in the event that the client's data is probably quite stale. It also represents a more compact call, passing just IDs rather than entire entities/view models. The response DTO can convey whether the operation was successful or not, and any details that the client code may need to convey that result. For instance if the tag is already associated and the client-side code would have actively prevented that, it may have been added by another user so we might want to suggest they refresh their view.

One issue with passing entities around is that there is no guarantee about the completion state of those entities, and in order for them to be reliable they need to be tracked by the DbContext. That introduces risks of unintended changes slipping in, or "poisoning" an update/insert operation because a call to SaveChanges updates all modified, tracked entities. We also don't need to worry about references to other attached or detached references. For instance I can associate the Tag to the loaded Article, I don't need to worry about Tag.ArticleTags or an ArticleTags DbSet. Operations should request the current Db state to ensure they take into account all changes across all user sessions.

If your entities are in a transitionary state where you might be waiting on a user to click a "Save" button before attempting to save, but you want to support an atomic "Add Tag" operation that doesn't rely on waiting for an overall "Save", then that could risk introducing a side-effect that any changes to the Article might be attempted to be saved when the user adds a tag. For this reason I recommend using things like POCO view models for views and loading entities as needed, and with the relationships needed for individual operations, and keeping them loaded only as long as needed. A big draw to passing entities around is the consideration to avoid round trips to the DB, however retrieving entities by ID (with only related data that is absolutely necessary) is quite fast, and it guarantees that requests are dealing with current state with info the code can use to provide useful feedback to the user and avoid stale data overwrites or obscure exception results.