I have this Database:
Clients => Incident => File => Filename
Clients have an ID Incidents have an ID and a reportedOn property Files have an ID and a fileSize, mimeType, malware property Filenames have an ID Client have a outgoing Edge to Incidents (reported), incident have a outgoing Edge to file (containsFile), file have a outgoing Edge to filename (hasName).
Here is some sample DATA:
g.addV('client').property('id','1').as('1').
addV('incident').property('id','11').property('reportedON', '2/15/2019 8:01:19 AM').as('11').
addV('file').property('id','100').property('fileSize', '432534').as('100').
addV('fileName').property('id','file.pdf').as('file.pdf').
addE('reported').from('1').to('11').
addE('containsFile').from('11').to('100').
addE('hasName').from('100').to('file.pdf').iterate()
In the C# Code below I am checking every fileName in the Database for special fileextensions. After that I take the fileNames which have these special fileextensions to get all their values and the vertices around them with their values in the second query which is in the foreachloop:
var resultSet = await SubmitQueryAsync("g.V().hasLabel('fileName')");
if (resultSet.Length > 0)
{
foreach (var result in resultSet)
{
JObject jsonData = result;
string fileId = jsonData["Id"].Value<string>();
string fileExtension = "";
string[] fileExtensions = { ".ace", ".arj", ".iso", ".rar", ".gz", ".acrj", ".lnk", ".z", ".tar", ".xz" };
HashSet<string> hSet = new HashSet<string>(fileExtensions);
if (fileId.Contains("."))
{
fileExtension = fileId.Substring(fileId.LastIndexOf('.'));
}
if (hSet.Contains(fileExtension))
{
var resultSet2 = await SubmitQueryAsync("g.V().has(id, '" + fileId + "').as('FILENAME').in('hasName').as('FILE').in('containsFile').as('INCIDENT').select('FILE').valueMap().as('FILEVALUES').select('INCIDENT').valueMap().as('INCIDENTVALUES').select('FILE', 'FILEVALUES', 'FILENAME', 'INCIDENTVALUES')");
list = FillList(list, resultSet2);
}
}
}
So for every fileName which have one of the special fileextensions I am executing one query in the foreachloop. The problem is that this are too many queries for the database. So how can I get this more efficient?
The first thing you probably need to do is change your data model and include a "ext" (i.e. "fileExtension") property on "fileName" so that you can search on it easily (I don't think CosmosDB supports TextP
or similar options yet for text searches), thus:
g.addV('client').property('id','1').as('1').
addV('incident').property('id','11').property('reportedON', '2/15/2019 8:01:19 AM').as('11').
addV('file').property('id','100').property('fileSize', '432534').as('100').
addV('fileName').property('id','file.pdf').property('ext','.pdf').as('file.pdf').
addE('reported').from('1').to('11').
addE('containsFile').from('11').to('100').
addE('hasName').from('100').to('file.pdf').iterate()
Then, it's pretty simple to roll all of that C# into a single Gremlin traversal:
gremlin> g.V().has('fileName','ext',within(".ace", ".arj", ".iso", ".rar", ".gz", ".acrj", ".lnk", ".z", ".tar", ".xz", ".pdf")).as('FILENAME').
......1> in('hasName').as('FILE').
......2> in('containsFile').as('INCIDENT').
......3> select('FILE').valueMap().as('FILEVALUES').
......4> select('INCIDENT').valueMap().as('INCIDENTVALUES').
......5> select('FILE', 'FILEVALUES', 'FILENAME', 'INCIDENTVALUES')
==>[FILE:v[5],FILEVALUES:[fileSize:[432534],id:[100]],FILENAME:v[8],INCIDENTVALUES:[reportedON:[2/15/2019 8:01:19 AM],id:[11]]]
Note that I added ".pdf" to your list of "extensions" so that it would return a result given your sample data. Aside from that, I think your query is really more complex that it should be - let's try to simplify because all the step labeling makes this hard to follow. I'd prefer some use of project()
:
gremlin> g.V().has('fileName','ext',within(".ace", ".arj", ".iso", ".rar", ".gz", ".acrj", ".lnk", ".z", ".tar", ".xz", ".pdf")).
......1> project('FILE','FILEVALUES','FILENAME','INCIDENTVALUES').
......2> by(__.in('hasName')).
......3> by(__.in('hasName').valueMap()).
......4> by().
......5> by(__.in('hasName').in('containsFile').valueMap())
==>[FILE:v[5],FILEVALUES:[fileSize:[432534],id:[100]],FILENAME:v[8],INCIDENTVALUES:[reportedON:[2/15/2019 8:01:19 AM],id:[11]]]
which Then makes me realize that "FILE" and "FILEVALUES" are basically the same thing and can be combined:
gremlin> g.V().has('fileName','ext',within(".ace", ".arj", ".iso", ".rar", ".gz", ".acrj", ".lnk", ".z", ".tar", ".xz", ".pdf")).
......1> project('FILEVALUES','FILENAME','INCIDENTVALUES').
......2> by(__.in('hasName').valueMap(true)).
......3> by().
......4> by(__.in('hasName').in('containsFile').valueMap())
==>[FILEVALUES:[id:5,fileSize:[432534],id:[100],label:file],FILENAME:v[8],INCIDENTVALUES:[reportedON:[2/15/2019 8:01:19 AM],id:[11]]]
I don't like that we traverser in('hasName')
twice so:
gremlin> g.V().has('fileName','ext',within(".ace", ".arj", ".iso", ".rar", ".gz", ".acrj", ".lnk", ".z", ".tar", ".xz", ".pdf")).
......1> project('FILEVALUES','FILENAME').
......2> by(__.in('hasName').
......3> project('FILE','INCIDENT').
......4> by(valueMap(true)).
......5> by(__.in('containsFile').valueMap())).
......6> by()
==>[FILEVALUES:[FILE:[id:5,fileSize:[432534],id:[100],label:file],INCIDENT:[reportedON:[2/15/2019 8:01:19 AM],id:[11]]],FILENAME:v[8]]
but that changes the structure of your returned result a little bit. I suppose that could be flattened back to what you had with more transformations, but I'm not sure you're worried about that. I'm just trying to help make the query more readable at this point.