Search code examples
phpregexpreg-split

Keeping the delimiter in preg_split in php


I am trying to split a text file into pieces based on a specific delimiter.

A snippet of the text file

    1_1_ABA-BUL
Sat           Tjedan(i)   Datum            Učiona     Predavač(i)        Kolegij             Način   Grupa
Ponedjeljak
08:00-10:00   1 - 17      5.10    - 25.1   DV 12      ŠTAMPALIJA ALKA    POSLOVNI NJEMAČKI   P       1_1_ABA-BUL, 1_2_BULJ-GAB, 1_3_GAC-JUK,
                                                                         JEZIK I                     1_4_JUL-LOR, 1_5_LOS-MOR
10:00-12:00   1 - 17      5.10    - 25.1   SPORTSKA   HERCEG ROMINA      TJELESNA I          S       1_1_2_BES-BUL
                                                                         ZDRAVSTVENA
                                                                         KULTURA I
12:00-14:00   1 - 17      5.10    - 25.1   DV 26      VARGA MLADEN       INFORMATIKA         P       1_1_ABA-BUL
Utorak
08:00-10:00   1 - 17      6.10    - 26.1   DV 12      ŠTAMPALIJA ALKA    POSLOVNI NJEMAČKI   S       1_1_ABA-BUL, 1_2_BULJ-GAB, 1_3_GAC-JUK,
                                                                         JEZIK I                     1_4_JUL-LOR, 1_5_LOS-MOR
08:00-10:00   1 - 17      6.10    - 26.1   DV 20      SLADOLJEV AGEJEV   POSLOVNI ENGLESKI   P       1_1_ABA-BUL
                                                      TAMARA             JEZIK I
12:00-14:00   1 - 17      6.10    - 26.1   DV 40      ZOROJA JOVANA      INFORMATIKA         S       1_1_1_ABA-BER
12:00-14:00   1 - 17      6.10    - 26.1   DV 18      SLADOLJEV AGEJEV   POSLOVNI ENGLESKI   S       1_1_2_BES-BUL
                                                      TAMARA             JEZIK I
Srijeda
08:00-11:00   1 - 17      7.10    - 27.1   DV 01      PULJIĆ KRUNOSLAV   MATEMATIKA          P       1_1_ABA-BUL
11:00-14:00   1 - 17      7.10    - 27.1   DV 01      KRPAN MIRA         OSNOVE EKONOMIJE    P       1_1_ABA-BUL
14:00-16:00   1 - 17      7.10    - 27.1   DV 11      SLADOLJEV AGEJEV   POSLOVNI ENGLESKI   S       1_1_1_ABA-BER
                                                      TAMARA             JEZIK I
Četvrtak
11:00-14:00   1 - 17      8.10    - 28.1   DV 04      SLIŠKOVIĆ MARINA   MATEMATIKA          S       1_1_1_ABA-BER
Petak
09:00-12:00   1 - 17      9.10    - 29.1   DV 20      KRPAN MIRA         OSNOVE EKONOMIJE    S       1_1_2_BES-BUL
10:00-12:00   1 - 17      9.10    - 29.1   SPORTSKA   HERCEG ROMINA      TJELESNA I          S       1_1_1_ABA-BER
ZDRAVSTVENA
KULTURA I

12:00-15:00   1 - 17   9.10   - 29.1   DV 09   KRPAN MIRA         OSNOVE EKONOMIJE   S   1_1_1_ABA-BER
13:00-16:00   1 - 17   9.10   - 29.1   DV 01   SLIŠKOVIĆ MARINA   MATEMATIKA         S   1_1_2_BES-BUL
16:00-18:00   1 - 17   9.10   - 29.1   DV 40   AVDIĆ AMMAR,       INFORMATIKA        S   1_1_2_BES-BUL
VANJSKI INF

1_2_BULJ-GAB
Sat            Tjedan(i)   Datum            Učiona     Predavač(i)          Kolegij             Način   Grupa
Ponedjeljak
08:00-10:00    1 - 17      5.10    - 25.1   DV 12      ŠTAMPALIJA ALKA      POSLOVNI NJEMAČKI   P       1_1_ABA-BUL, 1_2_BULJ-GAB, 1_3_GAC-JUK,
                                                                            JEZIK I                     1_4_JUL-LOR, 1_5_LOS-MOR
10:00-13:00    1 - 17      5.10    - 25.1   DV 16      ŠEGO BOŠKO           MATEMATIKA          P       1_2_BULJ-GAB
14:00-16:00    1 - 17      5.10    - 25.1   DV 17      LEKAJ LUBINA BORKA   POSLOVNI ENGLESKI   P       1_2_BULJ-GAB
                                                                            JEZIK I
Utorak
08:00-10:00    1 - 17      6.10    - 26.1   DV 12      ŠTAMPALIJA ALKA      POSLOVNI NJEMAČKI   S       1_1_ABA-BUL, 1_2_BULJ-GAB, 1_3_GAC-JUK,
                                                                            JEZIK I                     1_4_JUL-LOR, 1_5_LOS-MOR
10:00-12:00    1 - 17      6.10    - 26.1   DV 01      PEJIĆ BACH MIRJANA   INFORMATIKA         P       1_2_BULJ-GAB
15:00-18:00    1 - 17      6.10    - 26.1   DV 11      HERCEG TOMISLAV      OSNOVE EKONOMIJE    P       1_2_BULJ-GAB
Srijeda
08:00-10:00    1 - 17      7.10    - 27.1   DV 42      MILANOVIĆ GLAVAN     INFORMATIKA         S       1_2_1_BULJ-DAJ
LJUBICA
10:00-12:00    1 - 17      7.10    - 27.1   DV 42      MILANOVIĆ GLAVAN     INFORMATIKA         S       1_2_2_DAK-GAB
                                                       LJUBICA
10:00-12:00    1 - 17      7.10    - 27.1   SPORTSKA   HERCEG ROMINA        TJELESNA I          S       1_2_1_BULJ-DAJ
                                                                            ZDRAVSTVENA
                                                                            KULTURA I
15:00-18:00    1 - 17      7.10    - 27.1   DV 23      HERCEG TOMISLAV      OSNOVE EKONOMIJE    S       1_2_1_BULJ-DAJ
18:00-21:00    1 - 17      7.10    - 27.1   DV 23      HERCEG TOMISLAV      OSNOVE EKONOMIJE    S       1_2_2_DAK-GAB
Četvrtak
Petak
10:00-12:00    1 - 17      9.10    - 29.1   SPORTSKA   HERCEG ROMINA        TJELESNA I          S       1_2_2_DAK-GAB
                                                                            ZDRAVSTVENA
                                                                            KULTURA I
11:00-14:00    1 - 17      9.10    - 29.1   DV 19      ŠKRINJARIĆ TIHANA    MATEMATIKA          S       1_2_1_BULJ-DAJ

12:00-14:00   1 - 17   9.10   - 29.1   DV 16   LEKAJ LUBINA BORKA   POSLOVNI ENGLESKI   S   1_2_2_DAK-GAB
                                                                    JEZIK I
14:00-17:00   1 - 17   9.10   - 29.1   DV 02   ŠKRINJARIĆ TIHANA    MATEMATIKA          S   1_2_2_DAK-GAB
14:00-16:00   1 - 17   9.10   - 29.1   DV 10   LEKAJ LUBINA BORKA   POSLOVNI ENGLESKI   S   1_2_1_BULJ-DAJ
JEZIK I

I am splitting the string at the 1_1_ABA-BUL line and the other lines of the same format (above the "Sat" string).

This is my preg_split line

$grupe = preg_split("/((.*?)\nSat)/", $source, PREG_SPLIT_NO_EMPTY |  PREG_SPLIT_DELIM_CAPTURE);

The following preg_split line doesn't keep the delimiter (1_1_ABA-BUL etc.). If I change it to

  $grupe = preg_split("/((.*?)\nSat)/", $source, -1, PREG_SPLIT_NO_EMPTY |  PREG_SPLIT_DELIM_CAPTURE);

The resulting arrays are not correct (I get a corrupt result).

What am I doing wrong here?


Solution

  • The first way you use preg_split() seems to work but in fact it doesn't.

    It's third argument ($limit) is the number of pieces to return. It happened that the texts you used were not long enough and the number of pieces was smaller than or equal to 3 (the value of PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE).

    This is how you can make it work:

    $grupe = preg_split('/(.*?)\n(?=Sat)/', $source, -1, PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE);
    

    This is a (truncated) listing of print_r($grupe):

    Array
    (
        [0] =>     1_1_ABA-BUL
        [1] => Sat           Tjedan(i)   Datum            Učiona     Predavač(i)        Kolegij             Način   Grupa
    Ponedjeljak
    08:00-10:00   1 - 17      5.10    - 25.1   DV 12      ŠTAMPALIJA ALKA    POSLOVNI NJEMAČKI   P       1_1_ABA-BUL, 1_2_BULJ-GAB, 1_3_GAC-JUK,
                                                                             JEZIK I                     1_4_JUL-LOR, 1_5_LOS-MOR
    ...
    16:00-18:00   1 - 17   9.10   - 29.1   DV 40   AVDIĆ AMMAR,       INFORMATIKA        S   1_1_2_BES-BUL
    VANJSKI INF
    
    
        [2] => 1_2_BULJ-GAB
        [3] => Sat            Tjedan(i)   Datum            Učiona     Predavač(i)          Kolegij             Način   Grupa
    Ponedjeljak
    08:00-10:00    1 - 17      5.10    - 25.1   DV 12      ŠTAMPALIJA ALKA      POSLOVNI NJEMAČKI   P       1_1_ABA-BUL, 1_2_BULJ-GAB, 1_3_GAC-JUK,
    ...
    14:00-16:00   1 - 17   9.10   - 29.1   DV 10   LEKAJ LUBINA BORKA   POSLOVNI ENGLESKI   S   1_2_1_BULJ-DAJ
    JEZIK I
    
    )
    

    How it works:

    The important change of the regular expression is the assertion (?=Sat). It says the previous part of the regex ((.*?)\n) matches a part of the string only if that part of the string is followed by Sat. The assertion only checks the next characters from the input string but doesn't consume them. The Sat part does not become a part of the delimiter.

    The rest is unchanged. The flag PREG_SPLIT_DELIM_CAPTURE makes preg_split() return the captured parts of the regex delimiter as individual pieces of the output. Find them above at offsets 0 and 2. The Sat substring, being just an assertion is not part of the delimiter but it is returned in the next piece (where it belongs).