Search code examples
c#sql-serverdatabase-designdatabase-normalization

database design to count distinct values


I want to keep track of different aspects in my application. These aspects are reported by the client to the server. The client reports data in the following format:

AspectReport
  ClientId int, //client that reported the apsect
  AspectId int, //aspect that we got a value for
  Value string  //arbitrary string

The same aspect can be reported several times with the same or different value.

At server side I only want to keep track of unqiue values per client and aspect. Thus I need to be able to store which values have been reported, and from which client.

A normalized structure would probably look like:

Aspect
  Id
  Name

// used to store all reported values for 
// the aspect, no matter which client reported it
// (each client can report a subset of all values)
ApplicationAspectValues   
  Id
  AspectId 
  Value

// Which unique values a specific client reported
ClientAspectValues
  Id
  ClientId
  AspectId
  Value

There can be thousands of clients (our largest customer have 20 000 clients to date) that can report thousands of values per aspect for a handful of aspects.

Writes are a lot more frequent that reads. I expect a couple of hundred reported aspects per second in the long run. Reads are used at most a couple of times per day (for a client).

The aspect values will rarely be used. Only to show the reported values when analyzing the system deeper (a web app). Instead, the values are typically used to calculate the percentage of values reported for each aspect and client (ClientValues.Count / ApplicationValues.Count) and to make sure that only unique values are stored .

Following 3NF would mean that several rows are both queried and updated each time a new aspect is received which seems a bit inefficient.

Are there a better design that allows:

  • Optimized writes
  • Making sure that only unique values are stored
  • Allows all values for a client/aspect to be occasionally retrieved (to be displayed to the user).

My naive implementation is to do the duplicate check in code and then store a gzipped comma separated string with all values in a column in ApplicationAspectValues and ClientAspectValues along with a Counter field.

Like this:

ClientAspectValues
  Id
  ClientId
  AspectId
  NumberOfUniqueValues
  Values

In that way there is only one row used to check for duplicates and to update reported values. And for most read scenarios the values are never fetched.

I'm certain that there are much better solutions?

This question is posted here since a solution combining code and db is perfectly fine

(we have several other features in our product that write to the DB at the same frequency, the optimization isn't just to accommodate this feature but for the complete picture)


Solution

  • I would use your 3NF design, but use table ClientAspectValue to store all new writes.

    Then, I'd create a batch that would regularly read ClientAspectValue , get the unique values, and store them into a dedicated table - ClientAspectUniqueValues.