Same model performs very diferent in Keras and Flux

In a class I'm taking, the professor gave us two datasets, one of 301 late-type galaxies and the other one of 301 early-type galaxies, and we build a model in Keras so it can differentiate them:

input_img = Input(shape=(128,128,3))

x = Conv2D(filters = 16, kernel_size= (3,3), strides = (1,1), activation='relu', padding = 'same')(input_img)
x = MaxPooling2D((2,2),padding = 'same')(x)

x = Conv2D(filters = 32, kernel_size= (3,3), strides = (1,1), activation='relu', padding = 'same')(x)
x = MaxPooling2D((2,2),padding = 'same')(x)

x = Conv2D(filters = 64, kernel_size= (3,3), strides = (1,1), activation='relu', padding = 'same')(x)
x = MaxPooling2D((2,2),padding = 'same')(x)

x = Flatten()(x)
x = Dense(32, activation = 'relu')(x)
x = Dropout(0.3)(x)
x = Dense(16, activation = 'relu')(x)
out = Dense(1, activation = 'sigmoid')(x)

model = Model(inputs = input_img, outputs = out)
model.compile(loss = 'binary_crossentropy', optimizer = 'adam', metrics = ['accuracy'])
history = model.fit(X_train, Y_train, batch_size = 32, epochs = 20)

Since I like Julia more than Python, I tried to build the same model in Flux.jl and according to what I read in the Flux Docs this is what the Flux model looks like:

model2 = Chain(            
    Conv((3, 3), 3 => 16, relu, pad=SamePad(), stride=(1, 1)),
    MaxPool((2,2), pad=SamePad()),
    Conv((3, 3), 16 => 32, relu, pad=SamePad(), stride=(1, 1)),
    MaxPool((2,2), pad=SamePad()),
    Conv((3, 3), 32 => 64, relu, pad=SamePad(), stride=(1, 1)),
    MaxPool((2,2), pad=SamePad()),
    Flux.flatten,
    Dense(16384 => 32, relu),
    Dense(32 => 16, relu),

    Dense(16 => 1),
    sigmoid
)

But when I train the models in what I think are the same conditions, I get very different results. In Keras the final lost after 20 Epochs is loss: 0.0267 and in Flux after 30 Epochs the loss is 0.4082335f0, so I don't know where this difference in loss could come from since I'm using the same batch size in both models and the data treatment is the same (I think). Python:

X1 = np.load('/home/luis/Descargas/cosmo-late.npy')
X2 = np.load('/home/luis/Descargas/cosmo-early.npy')
X = np.concatenate((X1,X2), axis = 0).astype(np.float32)/256.0
Y = np.zeros(X.shape[0])
Y[0:len(X1)] = 1
rand_ind = np.arange(0,X.shape[0])
np.random.shuffle(rand_ind)
X = X[rand_ind]
Y = Y[rand_ind]
X_train = X[50:]
Y_train = Y[50:]
X_test = X[0:50]
Y_test = Y[0:50]

Julia:

X1 = npzread("./Descargas/cosmo-late.npy")
X2 = npzread("./Descargas/cosmo-early.npy")
X = cat(X1,X2,dims=1)
X = Float32.(X)./256
Y = zeros(1,size(X)[1])
Y[1,1:length(X1[:,1,1,1])] .= 1
ind = collect(1:length(Y[1,:]))
shuffle!(ind)
X = X[ind,:,:,:]
Y = Y[:,ind]
X_train = X[51:length(X[:,1,1,1]),:,:,:]
Y_train = Y[:,51:length(Y)]
X_test = X[1:50,:,:,:]
Y_test = Y[:,1:50]
X_train = permutedims(X_train, (2, 3, 4, 1))
X_test = permutedims(X_test, (2, 3, 4, 1))

And the training in Julia goes:

train_set = Flux.DataLoader((X_train, Y_train), batchsize=32)
loss(x, y) = Flux.logitbinarycrossentropy(x, y)
opt = Flux.setup(Adam(), model2)
loss_history = Float32[]

for epoch = 1:30
    Flux.train!(model2, train_set, opt) do m,x,y
        err = loss(m(x), y)
        ChainRules.ignore_derivatives() do
            push!(loss_history, err)
        end
        return err
    end
end

Can anyone pls help me, I can't figure it out.

Solution

Based on my comment about skipping sigmoid when using logitbinarycrossentropy, I had a quick go at testing this for some scrap data, and with your current implementation I also ended up at 0.5 ish loss, while after removing the sigmoid I reached much lower values.

You can also choose to keep the sigmoid and use binarycrossentropy instead, though it seems that is not as numerically stable so it is better to do it with logitbinarycrossentropy.