I'm trying to use Apache Solr as fulltext search engine in my .NET app (via SolrNet). My app has this data mode:
class Document
{
public int Id { get; set; };
public string Name { get; set; }
public DateTime CreateDate { get; set;}
public Attach[] Attaches { get; set; }
}
class Attach
{
public int Id { get; set; }
public Document Parent { get; set; }
//files are stored in filesystem, only path stored in database!
public string FilePath { get; set; }
}
Now, I'm trying to index this files (Castle.Windsor used):
_container.AddFacility("solr",
new SolrNetFacility("http://localhost:8983/solr"));
var solr = _container.Resolve<ISolrOperations<Document>>();
solr.Delete(SolrQuery.All);
var conn = _container.Resolve<ISolrConnection>();
var docs = from o in Documents
where o.Attaches.Count > 0
select o;
foreach (var doc in docs)
{
foreach (var att in doc.Attaches)
{
try
{
var file = Directory.GetFiles("C:\\Attachments\\" + doc.Id );
foreach (var s in file)
{
var a = File.ReadAllText(s);
conn.Post("/update", a);
}
}
catch (Exception)
{
throw;
}
}
}
solr.Commit();
solr.BuildSpellCheckDictionary();
As described in code, I'm searching file pathes, and adding file content directly from disk. But, when I'm posting file's text to Solr, I recieve thie error:
<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader">
<int name="status">400</int><int name="QTime">2</int>
</lst>
<lst name="error">
<str name="msg">Unexpected character 'Т' (code 1058 / 0x422) in prolog; expected '<'
at [row,col {unknown-source}]: [1,1]</str>
<int name="code">400</int>
</lst>
</response>
And I have this questions:
To answer your questions:
From your example code, it looks like you are interested in just indexing the plain text of the files. Based on that, I would create the following class for passing data to Solr.
public class IndexItem
{
[SolrField("id")]
public string Id { get; set; }
[SolrField("content")]
public string Content { get; set; }
}
Use this class to store the Id (must be a unique value) for each file that you read. The filename (also including the path) may be unique enough.
Change your example to the following:
_container.AddFacility("solr",
new SolrNetFacility("http://localhost:8983/solr"));
var solr = _container.Resolve<ISolrOperations<IndexItem>>();
solr.Delete(SolrQuery.All);
var docs = from o in Documents
where o.Attaches.Count > 0
select o;
foreach (var doc in docs)
{
foreach (var att in doc.Attaches)
{
try
{
var file = Directory.GetFiles("C:\\Attachments\\" + doc.Id );
foreach (var s in file)
{
var indexItem = new IndexItem();
indexItem.Id = s.FileName;
indexItem.Content = File.ReadAllText(s);
solr.Add(indexItem);
}
}
catch (Exception)
{
throw;
}
}
}
solr.Commit();
solr.BuildSpellCheckDictionary();
If you need to index more additional properties for each file, you can add them to the IndexItem class as I noticed that you have Name and CreateDate properties on the Document class above. You will just need to provide the mapping to the Solr so they are stored in an appropriate Solr field. Please see the SolrNet Mapping page for more details.