I am trying to split a text file into pieces based on a specific delimiter.
A snippet of the text file
1_1_ABA-BUL
Sat Tjedan(i) Datum Učiona Predavač(i) Kolegij Način Grupa
Ponedjeljak
08:00-10:00 1 - 17 5.10 - 25.1 DV 12 ŠTAMPALIJA ALKA POSLOVNI NJEMAČKI P 1_1_ABA-BUL, 1_2_BULJ-GAB, 1_3_GAC-JUK,
JEZIK I 1_4_JUL-LOR, 1_5_LOS-MOR
10:00-12:00 1 - 17 5.10 - 25.1 SPORTSKA HERCEG ROMINA TJELESNA I S 1_1_2_BES-BUL
ZDRAVSTVENA
KULTURA I
12:00-14:00 1 - 17 5.10 - 25.1 DV 26 VARGA MLADEN INFORMATIKA P 1_1_ABA-BUL
Utorak
08:00-10:00 1 - 17 6.10 - 26.1 DV 12 ŠTAMPALIJA ALKA POSLOVNI NJEMAČKI S 1_1_ABA-BUL, 1_2_BULJ-GAB, 1_3_GAC-JUK,
JEZIK I 1_4_JUL-LOR, 1_5_LOS-MOR
08:00-10:00 1 - 17 6.10 - 26.1 DV 20 SLADOLJEV AGEJEV POSLOVNI ENGLESKI P 1_1_ABA-BUL
TAMARA JEZIK I
12:00-14:00 1 - 17 6.10 - 26.1 DV 40 ZOROJA JOVANA INFORMATIKA S 1_1_1_ABA-BER
12:00-14:00 1 - 17 6.10 - 26.1 DV 18 SLADOLJEV AGEJEV POSLOVNI ENGLESKI S 1_1_2_BES-BUL
TAMARA JEZIK I
Srijeda
08:00-11:00 1 - 17 7.10 - 27.1 DV 01 PULJIĆ KRUNOSLAV MATEMATIKA P 1_1_ABA-BUL
11:00-14:00 1 - 17 7.10 - 27.1 DV 01 KRPAN MIRA OSNOVE EKONOMIJE P 1_1_ABA-BUL
14:00-16:00 1 - 17 7.10 - 27.1 DV 11 SLADOLJEV AGEJEV POSLOVNI ENGLESKI S 1_1_1_ABA-BER
TAMARA JEZIK I
Četvrtak
11:00-14:00 1 - 17 8.10 - 28.1 DV 04 SLIŠKOVIĆ MARINA MATEMATIKA S 1_1_1_ABA-BER
Petak
09:00-12:00 1 - 17 9.10 - 29.1 DV 20 KRPAN MIRA OSNOVE EKONOMIJE S 1_1_2_BES-BUL
10:00-12:00 1 - 17 9.10 - 29.1 SPORTSKA HERCEG ROMINA TJELESNA I S 1_1_1_ABA-BER
ZDRAVSTVENA
KULTURA I
12:00-15:00 1 - 17 9.10 - 29.1 DV 09 KRPAN MIRA OSNOVE EKONOMIJE S 1_1_1_ABA-BER
13:00-16:00 1 - 17 9.10 - 29.1 DV 01 SLIŠKOVIĆ MARINA MATEMATIKA S 1_1_2_BES-BUL
16:00-18:00 1 - 17 9.10 - 29.1 DV 40 AVDIĆ AMMAR, INFORMATIKA S 1_1_2_BES-BUL
VANJSKI INF
1_2_BULJ-GAB
Sat Tjedan(i) Datum Učiona Predavač(i) Kolegij Način Grupa
Ponedjeljak
08:00-10:00 1 - 17 5.10 - 25.1 DV 12 ŠTAMPALIJA ALKA POSLOVNI NJEMAČKI P 1_1_ABA-BUL, 1_2_BULJ-GAB, 1_3_GAC-JUK,
JEZIK I 1_4_JUL-LOR, 1_5_LOS-MOR
10:00-13:00 1 - 17 5.10 - 25.1 DV 16 ŠEGO BOŠKO MATEMATIKA P 1_2_BULJ-GAB
14:00-16:00 1 - 17 5.10 - 25.1 DV 17 LEKAJ LUBINA BORKA POSLOVNI ENGLESKI P 1_2_BULJ-GAB
JEZIK I
Utorak
08:00-10:00 1 - 17 6.10 - 26.1 DV 12 ŠTAMPALIJA ALKA POSLOVNI NJEMAČKI S 1_1_ABA-BUL, 1_2_BULJ-GAB, 1_3_GAC-JUK,
JEZIK I 1_4_JUL-LOR, 1_5_LOS-MOR
10:00-12:00 1 - 17 6.10 - 26.1 DV 01 PEJIĆ BACH MIRJANA INFORMATIKA P 1_2_BULJ-GAB
15:00-18:00 1 - 17 6.10 - 26.1 DV 11 HERCEG TOMISLAV OSNOVE EKONOMIJE P 1_2_BULJ-GAB
Srijeda
08:00-10:00 1 - 17 7.10 - 27.1 DV 42 MILANOVIĆ GLAVAN INFORMATIKA S 1_2_1_BULJ-DAJ
LJUBICA
10:00-12:00 1 - 17 7.10 - 27.1 DV 42 MILANOVIĆ GLAVAN INFORMATIKA S 1_2_2_DAK-GAB
LJUBICA
10:00-12:00 1 - 17 7.10 - 27.1 SPORTSKA HERCEG ROMINA TJELESNA I S 1_2_1_BULJ-DAJ
ZDRAVSTVENA
KULTURA I
15:00-18:00 1 - 17 7.10 - 27.1 DV 23 HERCEG TOMISLAV OSNOVE EKONOMIJE S 1_2_1_BULJ-DAJ
18:00-21:00 1 - 17 7.10 - 27.1 DV 23 HERCEG TOMISLAV OSNOVE EKONOMIJE S 1_2_2_DAK-GAB
Četvrtak
Petak
10:00-12:00 1 - 17 9.10 - 29.1 SPORTSKA HERCEG ROMINA TJELESNA I S 1_2_2_DAK-GAB
ZDRAVSTVENA
KULTURA I
11:00-14:00 1 - 17 9.10 - 29.1 DV 19 ŠKRINJARIĆ TIHANA MATEMATIKA S 1_2_1_BULJ-DAJ
12:00-14:00 1 - 17 9.10 - 29.1 DV 16 LEKAJ LUBINA BORKA POSLOVNI ENGLESKI S 1_2_2_DAK-GAB
JEZIK I
14:00-17:00 1 - 17 9.10 - 29.1 DV 02 ŠKRINJARIĆ TIHANA MATEMATIKA S 1_2_2_DAK-GAB
14:00-16:00 1 - 17 9.10 - 29.1 DV 10 LEKAJ LUBINA BORKA POSLOVNI ENGLESKI S 1_2_1_BULJ-DAJ
JEZIK I
I am splitting the string at the 1_1_ABA-BUL line and the other lines of the same format (above the "Sat" string).
This is my preg_split line
$grupe = preg_split("/((.*?)\nSat)/", $source, PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE);
The following preg_split line doesn't keep the delimiter (1_1_ABA-BUL etc.). If I change it to
$grupe = preg_split("/((.*?)\nSat)/", $source, -1, PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE);
The resulting arrays are not correct (I get a corrupt result).
What am I doing wrong here?
The first way you use preg_split()
seems to work but in fact it doesn't.
It's third argument ($limit
) is the number of pieces to return. It happened that the texts you used were not long enough and the number of pieces was smaller than or equal to 3 (the value of PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE
).
This is how you can make it work:
$grupe = preg_split('/(.*?)\n(?=Sat)/', $source, -1, PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE);
This is a (truncated) listing of print_r($grupe)
:
Array
(
[0] => 1_1_ABA-BUL
[1] => Sat Tjedan(i) Datum Učiona Predavač(i) Kolegij Način Grupa
Ponedjeljak
08:00-10:00 1 - 17 5.10 - 25.1 DV 12 ŠTAMPALIJA ALKA POSLOVNI NJEMAČKI P 1_1_ABA-BUL, 1_2_BULJ-GAB, 1_3_GAC-JUK,
JEZIK I 1_4_JUL-LOR, 1_5_LOS-MOR
...
16:00-18:00 1 - 17 9.10 - 29.1 DV 40 AVDIĆ AMMAR, INFORMATIKA S 1_1_2_BES-BUL
VANJSKI INF
[2] => 1_2_BULJ-GAB
[3] => Sat Tjedan(i) Datum Učiona Predavač(i) Kolegij Način Grupa
Ponedjeljak
08:00-10:00 1 - 17 5.10 - 25.1 DV 12 ŠTAMPALIJA ALKA POSLOVNI NJEMAČKI P 1_1_ABA-BUL, 1_2_BULJ-GAB, 1_3_GAC-JUK,
...
14:00-16:00 1 - 17 9.10 - 29.1 DV 10 LEKAJ LUBINA BORKA POSLOVNI ENGLESKI S 1_2_1_BULJ-DAJ
JEZIK I
)
How it works:
The important change of the regular expression is the assertion (?=Sat)
. It says the previous part of the regex ((.*?)\n
) matches a part of the string only if that part of the string is followed by Sat
. The assertion only checks the next characters from the input string but doesn't consume them. The Sat
part does not become a part of the delimiter.
The rest is unchanged. The flag PREG_SPLIT_DELIM_CAPTURE
makes preg_split()
return the captured parts of the regex delimiter as individual pieces of the output. Find them above at offsets 0
and 2
. The Sat
substring, being just an assertion is not part of the delimiter but it is returned in the next piece (where it belongs).