Search code examples
c#solrsolrnet

Solr - store multiple Word documents for a single unique ID


We'd like to index and store a group of Word documents in Solr and have them appear as elements of a multivalued text field, with the content of each appearing as an element under that one entry in the index. In other words, it looks like this

  • ID
    • abcdef [text of Word_1.docx]
    • xyzabc [text of Word_2.docx]
    • efghij [text of Word_3.docx]

We don't want each document indexed to have its own unique ID; a group of documents will be children of a particular ID. There can be any number of documents for that ID. How to do this?

UPDATE: Here's my C# code; how would I read multiple documents into this for the unique ID being set with the (++count).ToString()?

using (FileStream fileStream = File.OpenRead(path))
{

    solr.Extract(
        new ExtractParameters(fileStream, (++_count).ToString())
            {
                ExtractFormat = ExtractFormat.Text,
                ExtractOnly = false,
                Fields = new List<ExtractField>()
                                                 {
                                                     new ExtractField("action", actionTo),
                                                     new ExtractField("actiondate", actionDate),
                                                     new ExtractField("abstract", abstract),
                                                     new ExtractField("docval", docval),
                                                     new ExtractField("documentgeo",documentgeo),
                                                     new ExtractField("filename", filename),
                                                     new ExtractField("isprimary", IsPrimary.ToString())
                                                 },
                                    AutoCommit = true 
            }
        );
}

Solution

  • In your SOLR schema define two fields - id and text. text should be multivalued. Then aggregate in your SolrInputDocuments the text data for the id and index.

    <field name="id" type="int" multiValued="false" stored="true" indexed="true" />
    <field name="text" type="text" multiValued="true" stored="true" indexed="true" />
    

    I don't know the c# API, but using SolrJ it is fairly easy to aggregate using SolrInputDocument.addField("fieldname", "value").

    Example update

    SolrInputDocument doc = new SolrInputDocument();
    doc.addField("id", 1)
    for (String docText : documents){
        doc.addField("text", docText)
    }
    

    Example .NET update

    I would define my class in the following way:

    public class Document{
    [SolrUniqueKey("id")]
    public integer Id { get; set; }
    
    [SolrField("text")]
    public ICollection<string> texts { get; set; }
    

    Then I will populate it and submit with something like this pseudo-.NET code:

    Document doc = new Document();
    for (String documentPath : paths) {
        using (FileStream fileStream = File.OpenRead(path)) {
            string id = fileStream.getId();
            if (doc.getId() == id){
                doc.getTexts.add(fileStream.getText())
            }
        }
    }
    var solr = ServiceLocator.Current.GetInstance<ISolrOperations<Document>>();
    solr.Add(doc);
    solr.Commit();