Search code examples
mysqlcorpus

Populating sentences from large corpus table


I have five tables containing lexical data. I want to display sentences from corpus to the given Icelandic lemma (including all word forms). Using following approach, it takes 2 seconds to find 5 sentences. I am looking for a solution that can display all sentences available.

The expected result:

The list of sentences that contain all word forms of given lemma specified in the query.

Current result:

Current results returns only the sentences that match the word with the keyword in basic form:

 word_form  w_id    s_id    pos     sentence
 hest   11484   794930  1   Sentence 1. .....
 hest   11484   795623  12  Sentence 2 .....

Expected result:

 word_form  w_id    s_id    pos     sentence
 hest   11484   794930  1   Sentence 1. .....
 hest   11484   795623  12  Sentence 2 .....
 ...
 hestur ..   ..   ..  Sentence 13.
 hestur ..    ..  .. Sentence 14.
 ...
 hesti ..  ..  .. Sentence 21.
 ...

Proposed query with changes, but ends with error.

SELECT w0.keyword, w.word_form, w3.w_id, w4.s_id, w4.pos, s.sentence 
FROM `1_headword` w0
INNER JOIN `2_wordform` w ON w.keyword = w0.keyword
INNER JOIN `3_words` w3 ON w3.word = w.word_form
INNER JOIN `4_inv_w` w4 ON w4.w_id = w3.w_id 
INNER JOIN `5_sentences` s 
ON s.s_id = w4.s_id WHERE w0.keyword like 'hestur' group by w4.s_id

Notes: Keyword is one, the basic form - in this case "hestur". The word forms are in this case - "hest", "hesti", "hestar" (see the Insert table) etc. In other words, the query should take all wordform of given lemma and match it sentences in which the wordforms occur.

Update II. Few observation. 1.The following simplified query to receive w_id for all word forms returns rows with repeated w_id of the first word form. 2. The word forms can have several rows in 3_words table.

   SELECT w.keyword, w.word_form, w3.w_id FROM `2_wordform1` w 
   JOIN `3_words` w3 
   ON w3.word = w.keyword and w3.gram = w.gram 
   WHERE w.keyword like 'tala' and w.gram = 'f' 

Rows

    tala    tala    8809
    tala    tala    89664
    tala    tala    97991
    Tala    Tala    8809
    Tala    Tala    89664
    Tala    Tala    97991
    tala    tölur   8809
    tala    tölur   89664
    tala    tölur   97991

Tables and data

table - headwords, 70000 rows

CREATE TABLE IF NOT EXISTS `1_headword` (
`id` int(9) NOT NULL,
  `keyword` varchar(100) CHARACTER SET utf8 COLLATE utf8_icelandic_ci NOT NULL,
  `num_keyword` int(9) NOT NULL DEFAULT '0',
  `gram` varchar(40) NOT NULL
) ENGINE=MyISAM  DEFAULT CHARSET=utf8 ROW_FORMAT=DYNAMIC AUTO_INCREMENT=55328 ;
ALTER TABLE `1_headword`
 ADD PRIMARY KEY (`id`), ADD KEY `keyword` (`keyword`);

table - word forms - 700 000 rows

CREATE TABLE IF NOT EXISTS `2_wordform` (
`id` int(10) NOT NULL,
  `keyword` varchar(120) CHARACTER SET utf8 COLLATE utf8_icelandic_ci NOT NULL,
  `num_keyword` int(4) NOT NULL,
  `word_form` varchar(120) CHARACTER SET utf8 COLLATE utf8_icelandic_ci NOT NULL
) ENGINE=MyISAM  DEFAULT CHARSET=utf8 AUTO_INCREMENT=678480 ;
ALTER TABLE `2_wordform`
 ADD PRIMARY KEY (`id`), ADD KEY `word_form` (`word_form`);

table - word forms tagged from corpus with w_id (word id), 1 million of rows

CREATE TABLE `3_words` (
  `w_id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `word` varchar(255) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL,
  `gram` varchar(255) DEFAULT NULL,
  `freq` int(10) unsigned DEFAULT NULL,
  PRIMARY KEY (`w_id`),
  KEY `word` (`word`),
  KEY `w_id` (`w_id`)
) ENGINE=MyISAM AUTO_INCREMENT=800468 DEFAULT CHARSET=utf8;

table - w_id (word id) connected to s_id (sentence id), word can be found in several sentences, plus position in the sentence, 22 millions of rows

CREATE TABLE `4_inv_w` (
  `w_id` int(10) unsigned NOT NULL DEFAULT '0',
  `s_id` int(10) unsigned NOT NULL DEFAULT '0',
  `pos` mediumint(2) unsigned NOT NULL DEFAULT '0',
  KEY `w_id` (`w_id`),
  KEY `s_id` (`s_id`),
  KEY `w_s` (`w_id`,`s_id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;

table - s_id (sentence id) with sentence, 1 million of rows

CREATE TABLE `5_sentences` (
  `s_id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `sentence` text,
  KEY `s_id` (`s_id`)
) ENGINE=MyISAM AUTO_INCREMENT=999953 DEFAULT CHARSET=utf8;

Process

select all word forms of given lemma f.e "hestur" (horse in English)

SELECT `word_form` FROM `2_wordform` WHERE `keyword` like 'hestur'

result consist from 16 to 50 results, now cycle the result as f.e. with accusative "hest" of "hestur"

SELECT `w_id` FROM `3_words` WHERE `word` like 'hest'

the result can contain several w_id, f.e. with '10138'

SELECT `s_id`, `pos` FROM `4_inv_w` WHERE `w_id` = '10138' group by `s_id`

the result can contain several sentences, to display f.e. sentence '7201'

SELECT `sentence` FROM `5_sentences` WHERE `s_id` = '7201'

Update INSERT INTO 2_wordform (id, keyword, num_keyword, word_form) VALUES(42490, 'hestur', 0, 'hest'); INSERT INTO 2_wordform (id, keyword, num_keyword, word_form) VALUES(42498, 'hestur', 0, 'hesta'); INSERT INTO 2_wordform (id, keyword, num_keyword, word_form) VALUES(42501, 'hestur', 0, 'hestana'); INSERT INTO 2_wordform (id, keyword, num_keyword, word_form) VALUES(42503, 'hestur', 0, 'hestanna'); INSERT INTO 2_wordform (id, keyword, num_keyword, word_form) VALUES(42497, 'hestur', 0, 'hestar'); INSERT INTO 2_wordform (id, keyword, num_keyword, word_form) VALUES(42500, 'hestur', 0, 'hestarnir'); INSERT INTO 2_wordform (id, keyword, num_keyword, word_form) VALUES(42491, 'hestur', 0, 'hesti'); INSERT INTO 2_wordform (id, keyword, num_keyword, word_form) VALUES(42494, 'hestur', 0, 'hestinn'); INSERT INTO 2_wordform (id, keyword, num_keyword, word_form) VALUES(42495, 'hestur', 0, 'hestinum'); INSERT INTO 2_wordform (id, keyword, num_keyword, word_form) VALUES(42492, 'hestur', 0, 'hests'); INSERT INTO 2_wordform (id, keyword, num_keyword, word_form) VALUES(42496, 'hestur', 0, 'hestsins'); INSERT INTO 2_wordform (id, keyword, num_keyword, word_form) VALUES(42499, 'hestur', 0, 'hestum'); INSERT INTO 2_wordform (id, keyword, num_keyword, word_form) VALUES(42502, 'hestur', 0, 'hestunum'); INSERT INTO 2_wordform (id, keyword, num_keyword, word_form) VALUES(42489, 'hestur', 0, 'hestur'); INSERT INTO 2_wordform (id, keyword, num_keyword, word_form) VALUES(42493, 'hestur', 0, 'hesturinn');

INSERT INTO 3_words (w_id, word, gram, freq) VALUES (11484, 'hestur', 'nken', 122), (60681, 'Hestur', 'nken', 15), (484318, 'HESTUR', 'nken', 1), (491111, 'Hestur', 'nken-s', 1);

INSERT INTO 3_words (w_id, word, gram, freq) VALUES (10138, 'hest', 'nkeo', 141), (159967, 'Hest', 'nkeo', 4), (491114, 'Hest', 'ssm', 1);

INSERT INTO 4_inv_w (w_id, s_id, pos) VALUES (11484, 2671, 4), (11484, 22522, 7), (11484, 30169, 8), (11484, 32487, 4), (11484, 33841, 9), (11484, 38116, 5), (11484, 40450, 6), (11484, 42741, 32), (11484, 45789, 10), (11484, 58998, 3), (11484, 74343, 4), (11484, 76001, 3), (11484, 99014, 9), (11484, 99688, 6), (11484, 109849, 21), (11484, 119708, 21), (11484, 131353, 34), (11484, 147820, 6), (11484, 148326, 25), (11484, 160475, 40), (11484, 167227, 2), (11484, 170401, 3), (11484, 178416, 18), (11484, 197295, 12), (11484, 197295, 6), (11484, 198420, 19), (11484, 203446, 28), (11484, 204448, 1), (11484, 215402, 1), (11484, 237323, 4), (11484, 249282, 4), (11484, 263949, 1), (11484, 263949, 22), (11484, 266489, 27), (11484, 270540, 5), (11484, 272543, 5), (11484, 272560, 1), (11484, 272560, 8), (11484, 282170, 20), (11484, 284407, 26), (11484, 290524, 6), (11484, 291438, 10), (11484, 293344, 6), (11484, 294034, 49), (11484, 317007, 7), (11484, 325049, 22), (11484, 328392, 14), (11484, 368188, 47), (11484, 391892, 14), (11484, 401157, 11), (11484, 412656, 24), (11484, 421635, 17), (11484, 439320, 3), (11484, 467063, 5), (11484, 469324, 23), (11484, 477392, 2), (11484, 480318, 4), (11484, 487883, 1), (11484, 490577, 42), (11484, 499783, 9), (11484, 500405, 23), (11484, 501118, 15), (11484, 527227, 3), (11484, 539686, 25), (11484, 543056, 9), (11484, 544261, 3), (11484, 547700, 20), (11484, 555638, 19), (11484, 570234, 2), (11484, 592710, 2), (11484, 616662, 1), (11484, 619011, 16), (11484, 632123, 2), (11484, 633124, 2), (11484, 636792, 8), (11484, 636792, 3), (11484, 646603, 17), (11484, 664738, 4), (11484, 670017, 4), (11484, 685997, 4), (11484, 686202, 1), (11484, 691794, 12), (11484, 698341, 2), (11484, 715281, 3), (11484, 715984, 37), (11484, 716970, 10), (11484, 716970, 4), (11484, 752605, 36), (11484, 756660, 19), (11484, 760277, 3), (11484, 776593, 3), (11484, 785701, 24), (11484, 789099, 3), (11484, 794930, 1), (11484, 795623, 12), (11484, 802997, 6), (11484, 812806, 6), (11484, 814046, 21), (11484, 820178, 6), (11484, 823173, 22), (11484, 843094, 3), (11484, 844156, 1), (11484, 844736, 24), (11484, 853350, 18), (11484, 869322, 3), (11484, 885176, 2), (11484, 899545, 22), (11484, 904086, 16), (11484, 907863, 9), (11484, 909396, 9), (11484, 912876, 3), (11484, 919994, 4), (11484, 927840, 24), (11484, 927840, 5), (11484, 934220, 40), (11484, 936941, 11), (11484, 952837, 13), (11484, 969201, 11), (11484, 970240, 1), (11484, 970836, 19), (11484, 972107, 1), (11484, 990474, 6);

INSERT INTO 4_inv_w (w_id, s_id, pos) VALUES (10138, 7201, 27), (10138, 18772, 3), (10138, 30001, 6), (10138, 42089, 4), (10138, 42089, 14), (10138, 42234, 4), (10138, 49383, 5), (10138, 54795, 18), (10138, 57564, 23), (10138, 88542, 7), (10138, 93027, 10), (10138, 101097, 21), (10138, 134312, 12), (10138, 139116, 33), (10138, 139522, 6), (10138, 159109, 7), (10138, 159109, 16), (10138, 161497, 21), (10138, 163948, 2), (10138, 165301, 20), (10138, 166478, 21), (10138, 183452, 6), (10138, 184390, 20), (10138, 189930, 25), (10138, 201629, 9), (10138, 204590, 4), (10138, 211374, 5), (10138, 216483, 14), (10138, 223617, 5), (10138, 233652, 12), (10138, 236571, 11), (10138, 241302, 8), (10138, 246485, 10), (10138, 256910, 16), (10138, 262349, 3), (10138, 262925, 5), (10138, 267047, 28), (10138, 291988, 18), (10138, 292680, 22), (10138, 294814, 32), (10138, 326917, 6), (10138, 330019, 12), (10138, 333411, 35), (10138, 337880, 5), (10138, 342003, 13), (10138, 355325, 12), (10138, 356409, 13), (10138, 363795, 5), (10138, 365735, 26), (10138, 376570, 25), (10138, 378214, 10), (10138, 379159, 11), (10138, 379236, 4), (10138, 379533, 2), (10138, 388753, 8), (10138, 420633, 18), (10138, 433121, 5), (10138, 434645, 10), (10138, 435895, 3), (10138, 455575, 5), (10138, 461900, 23), (10138, 464040, 6), (10138, 466657, 6), (10138, 469848, 11), (10138, 475569, 17), (10138, 482701, 41), (10138, 527708, 29), (10138, 527708, 16), (10138, 529426, 7), (10138, 530753, 10), (10138, 538071, 27), (10138, 542685, 10), (10138, 553742, 22), (10138, 553742, 13), (10138, 557216, 4), (10138, 563747, 9), (10138, 564716, 4), (10138, 569146, 7), (10138, 578368, 3), (10138, 581713, 9), (10138, 595890, 9), (10138, 599015, 5), (10138, 608570, 30), (10138, 610218, 11), (10138, 610218, 2), (10138, 612099, 9), (10138, 612568, 14), (10138, 612894, 9), (10138, 615361, 19), (10138, 618001, 14), (10138, 624969, 7), (10138, 628252, 16), (10138, 628635, 12), (10138, 635977, 10), (10138, 643675, 8), (10138, 650487, 9), (10138, 651489, 3), (10138, 657552, 18), (10138, 672884, 12), (10138, 677130, 2), (10138, 678841, 7), (10138, 678841, 26), (10138, 682904, 4), (10138, 691251, 19), (10138, 706325, 9), (10138, 714680, 45), (10138, 717460, 5), (10138, 717489, 11), (10138, 722393, 5), (10138, 729972, 12), (10138, 735745, 12), (10138, 738334, 7), (10138, 740791, 21), (10138, 775696, 8), (10138, 776984, 16), (10138, 786073, 31), (10138, 793185, 17), (10138, 821475, 4), (10138, 835234, 7), (10138, 842713, 3), (10138, 842730, 8), (10138, 847372, 9), (10138, 849612, 20), (10138, 861768, 26), (10138, 864231, 6), (10138, 865927, 7), (10138, 873939, 7), (10138, 883591, 29), (10138, 884260, 19), (10138, 894952, 17), (10138, 898453, 19), (10138, 899290, 4), (10138, 909225, 29), (10138, 910173, 4), (10138, 922447, 2), (10138, 939319, 2), (10138, 956278, 4), (10138, 967342, 18), (10138, 977090, 3), (10138, 991346, 31), (10138, 991346, 40);

INSERT INTO 5_sentences (s_id, sentence) VALUES (2671, 'Hrímnir|nken-s frá|aþ Hrafnagili|nkeþ-s Glæsilegasti|lkenve hestur|nken aldar|nvee !|!');

INSERT INTO 5_sentences (s_id, sentence) VALUES (7201, 'Hann|fpken heilsar|sfg3en öllum|fokfþ nema|c Braga|nkeþ-s sem|ct nú|aa dregur|sfg3en í|ao land|nheo og|c vill|sfg3en friðmælast|snm við|ao Loka|nkeo-s með|aþ loforði|nheþ um|ao góðar|lvfosf gjafir|nvfo ,|, sverð|nhfo ,|, hest|nkeo og|c hring|nkeo en|c hann|fpken svarar|sfg3en bara|aa með|aþ illu|lheþsf .|.');


Solution

  • Any sample of data would be very helpful together with expected result.

    So far you can start form this attempt:

    http://sqlfiddle.com/#!9/7110ef/4

    SELECT w.word_form,
           w3.w_id,
           w4.s_id,
           w4.pos,
           s.sentence
    FROM `2_wordform` w 
    INNER JOIN `3_words` w3
    ON w3.`word` = w.keyword
    INNER JOIN `4_inv_w`  w4
    ON w4.w_id = w3.w_id
    INNER JOIN `5_sentences` s
    ON s.s_id = w4.s_id
    WHERE w.keyword like '%hestur%'
    

    if you need only sentence you can:

    SELECT DISTINCT s.sentence
    FROM `2_wordform` w 
    INNER JOIN `3_words` w3
    ON w3.`word` = w.keyword
    INNER JOIN `4_inv_w`  w4
    ON w4.w_id = w3.w_id
    INNER JOIN `5_sentences` s
    ON s.s_id = w4.s_id
    WHERE w.keyword like '%hestur%'