Search code examples
multithreadingoptaplanner

Optaplanner config with moveThreadCount=1 not same as no moveThreadCount


I have upgraded to Optaplanner 7.12, and in hunting for a potential issue with multithreading mixed with VariableListeners, noticed an oddity in reproducible execution: if the configuration file has <moveThreadCount>1</moveThreadCount>, the execution is NOT the same as when the moveThreadCount line is absent, which seems unexpected to me as a user, and might be intertwined with the potential optaplanner race condition I am seeing (noted at the end of this post).

Code details

I observe this for a configuration file in REPRODUCIBLE mode with a fixed seed: <environmentMode>REPRODUCIBLE</environmentMode> <randomSeed>50</randomSeed>

The difference in Optaplanner behavior is visible during my use of a VariableListener. I have a set of custom MoveFactory classes for a nurse-rostering-derived model. Each custom factory generates a different set of moves for each STEP, and each decides on their moves based on a common set of state-dependent computationally intensive pre-calculations. I created a MoveFactoryHelper class that makes the precalculations, and then call the helper at the beginning of each custom MoveFactory's createMoveList method (I have not yet attempted to migrate to the newer Optaplanner iterative move generation option).

To avoid repeating the computations for each of the move factories, the MoveFactoryHelper stores its result for re-use, and decides when to re-compute based on a 'dirty' flag that is set in a VariableListener registered on a (completely unused) shadow to the model's PlanningEntity, and cleared by the MoveFactoryHelper when the re-computations occur:

ShiftAssignment.java

    @PlanningEntity(movableEntitySelectionFilter = MovableShiftAssignmentSelectionFilter.class,
        difficultyComparatorClass = ShiftAssignmentDifficultyComparator.class)
    @XStreamAlias("ShiftAssignment")
    public class ShiftAssignment extends AbstractPersistable {

        ...

        @PlanningVariable(valueRangeProviderRefs = {"employeeRange"},
            strengthComparatorClass = EmployeeStrengthComparator.class
            )
        private Employee employee;

        ...

        @CustomShadowVariable( variableListenerClass=UpdatingEmployeeVariableListener.class, 
            sources={@PlanningVariableReference(variableName="employee", entityClass=ShiftAssignment.class)})
        private Employee notifierEmployee;  // TODO is there a better way to notify move factory of changes in problem facts?

UpdatingEmployeeVariableListener.java

    private static final Logger logger = LoggerFactory.getLogger(UpdatingEmployeeVariableListener.class);

    private static final boolean initiallyDirty = true;
    private static Map<Thread, Boolean> employeeShiftAssignmentEntityDirty = new HashMap<Thread, Boolean>();
    private static Map<Thread, Boolean> employeeShiftAssignmentMapDirty = new HashMap<Thread, Boolean>();

    private static final boolean useThreadFlags = false;

    // debug monitoring
    private static Map<Thread, Integer> countDirtyAllFlags = new HashMap<Thread, Integer>();
    private static Map<Thread, Integer> countBeforeEntityAdded = new HashMap<Thread, Integer>();
    private static Map<Thread, Integer> countAfterEntityAdded = new HashMap<Thread, Integer>();
    private static Map<Thread, Integer> countBeforeVariableChanged = new HashMap<Thread, Integer>();
    private static Map<Thread, Integer> countAfterVariableChanged = new HashMap<Thread, Integer>();
    private static Map<Thread, Integer> countBeforeEntityRemoved = new HashMap<Thread, Integer>();
    private static Map<Thread, Integer> countAfterEntityRemoved = new HashMap<Thread, Integer>();

    public UpdatingEmployeeVariableListener() {
        // no action
    }

    private static Thread getActiveThread() {
        return useThreadFlags ? Thread.currentThread() : null;
    }

    public static void setFlagsDirty() {
        countDirtyAllFlags.put(getActiveThread(), 1+countDirtyAllFlags.getOrDefault(getActiveThread(), 0));
        employeeShiftAssignmentEntityDirty.put(getActiveThread(), true);
        employeeShiftAssignmentMapDirty.put(getActiveThread(), true);
    }

    @Override
    public void beforeEntityAdded(@SuppressWarnings("rawtypes") ScoreDirector scoreDirector, ShiftAssignment entity) {
        countBeforeEntityAdded.put(getActiveThread(), 1+countBeforeEntityAdded.getOrDefault(getActiveThread(), 0));
        employeeShiftAssignmentMapDirty.put(getActiveThread(), true);
    }

    @Override
    public void afterEntityAdded(@SuppressWarnings("rawtypes") ScoreDirector scoreDirector, ShiftAssignment entity) {
        countAfterEntityAdded.put(getActiveThread(), 1+countAfterEntityAdded.getOrDefault(getActiveThread(), 0));
        employeeShiftAssignmentMapDirty.put(getActiveThread(), true);
    }

    @Override
    public void beforeVariableChanged(@SuppressWarnings("rawtypes") ScoreDirector scoreDirector,
            ShiftAssignment entity) {
        countBeforeVariableChanged.put(getActiveThread(), 1+countBeforeVariableChanged.getOrDefault(getActiveThread(), 0));
        employeeShiftAssignmentMapDirty.put(getActiveThread(), true);
    }

    @Override
    public void afterVariableChanged(@SuppressWarnings("rawtypes") ScoreDirector scoreDirector,
            ShiftAssignment entity) {
        countAfterVariableChanged.put(getActiveThread(), 1+countAfterVariableChanged.getOrDefault(getActiveThread(), 0));
        employeeShiftAssignmentMapDirty.put(getActiveThread(), true);
    }

    @Override
    public void beforeEntityRemoved(@SuppressWarnings("rawtypes") ScoreDirector scoreDirector, ShiftAssignment entity) {
        countBeforeEntityRemoved.put(getActiveThread(), 1+countBeforeEntityRemoved.getOrDefault(getActiveThread(), 0));
        employeeShiftAssignmentMapDirty.put(getActiveThread(), true);
    }

    @Override
    public void afterEntityRemoved(@SuppressWarnings("rawtypes") ScoreDirector scoreDirector, ShiftAssignment entity) {
        countAfterEntityRemoved.put(getActiveThread(), 1+countAfterEntityRemoved.getOrDefault(getActiveThread(), 0));
        employeeShiftAssignmentMapDirty.put(getActiveThread(), true);
    }

    /**
     * @return the employeeShiftAssignmentEntityDirty
     */
    public static boolean isEmployeeShiftAssignmentEntityDirty() {
        return employeeShiftAssignmentEntityDirty.getOrDefault(getActiveThread(), initiallyDirty);
    }

    /**
     * clears isEntityDirty, implying that the (externally maintained) employee shift assignment entity list has been updated 
     */
    public static void clearEmployeeShiftAssignmentEntityDirty() {
        employeeShiftAssignmentEntityDirty.put(getActiveThread(), false);       
    }

    /**
     * @return the mapDirty (which is depending also on entityDirty)
     */
    public static boolean isEmployeeShiftAssignmentMapDirty() {
        return employeeShiftAssignmentMapDirty.getOrDefault(getActiveThread(), initiallyDirty) || isEmployeeShiftAssignmentEntityDirty();
    }

    /**
     * clears isMapDirty, implying that the (externally maintained) employee shift assignment map has been updated (as well as the underlying entity) 
     */
    public static void clearEmployeeShiftAssignmentMapDirty() {
        clearEmployeeShiftAssignmentEntityDirty();
        employeeShiftAssignmentMapDirty.put(getActiveThread(), false);
        logger.debug("Clearing dirty flag: (AF={}, BEA={}, AEA={}, BVC={}, AVC={}, BER={}, AER={}) thread={}, employeeShiftAssignmentEntityDirty={}, employeeShiftAssignmentMapDirty={}", 
                countDirtyAllFlags.getOrDefault(getActiveThread(), 0),
                countBeforeEntityAdded.getOrDefault(getActiveThread(), 0),
                countAfterEntityAdded.getOrDefault(getActiveThread(), 0),
                countBeforeVariableChanged.getOrDefault(getActiveThread(), 0),
                countAfterVariableChanged.getOrDefault(getActiveThread(), 0),
                countBeforeEntityRemoved.getOrDefault(getActiveThread(), 0),
                countAfterEntityRemoved.getOrDefault(getActiveThread(), 0),
                getActiveThread(),
                employeeShiftAssignmentEntityDirty, 
                employeeShiftAssignmentMapDirty);
        clearCounts();
    }

    private static void clearCounts() {
        countDirtyAllFlags.put(getActiveThread(), 0);
        countBeforeEntityAdded.put(getActiveThread(), 0);
        countAfterEntityAdded.put(getActiveThread(), 0);
        countBeforeVariableChanged.put(getActiveThread(), 0);
        countAfterVariableChanged.put(getActiveThread(), 0);
        countBeforeEntityRemoved.put(getActiveThread(), 0);
        countAfterEntityRemoved.put(getActiveThread(), 0);
    }
}

(note that the maps-of-booleans and maps-of-integers are effectively single booleans and integers here, because the Thread in the map lookup is currently always null due to the final useThreadFlags=false)

I confirmed that only the MoveFactory objects call the MoveFactoryHelper. Similarly, other than the VariableListener annotation above, and the flag queries/clear from MoveFactoryHelper, the only other calls to the UpdatingEmployeeVariableListener are to call UpdatingEmployeeVariableListener.setFlagsDirty() just before solving is started:

        @Override
        public void actionPerformed(ActionEvent e) {
            UpdatingEmployeeVariableListener.setFlagsDirty();
            setSolvingState(true);
            Solution_ problem = solutionBusiness.getSolution();
            new SolveWorker(problem).execute();
        }

and just after solving is stopped:

    solver.terminateEarly();
    UpdatingEmployeeVariableListener.setFlagsDirty();

The template of maps-by-thread is new, but the underlying use of boolean flags has been executing for years successfully:

  1. the flags get set dirty due to optaplanner calls to beforeVariableChanged and afterVariableChanged on the planning entities
  2. the first MoveFactory calls the MoveFactoryHelper which calls UpdatingEmployeeVariableListener.isEmployeeShiftAssignmentMapDirty() which has a result of true. MoveFactoryHelper re-computes based on the current state and then calls to clear the dirty flags
  3. the rest of the MoveFactory objects call MoveFactoryHelper which sees false on the is...Dirty() query, and hence can re-use its computations.
  4. Optaplanner tests the many candidate moves, which dirties the flags again, and after choosing a move for this Step, calls the MoveFactory.createMoveList methods again early in the next step, which repeats the cycle.

Log details showing odd Optaplanner behavior

With the upgrade to 7.12, and without a moveThreadCount configuration line, the code continues to run correctly and reproducibly when I do not have the moveThreadCount xml element defined:

11:20:37.274 INFO  Solving started: time spent (422), best score (0hard/-5340soft), environment mode (REPRODUCIBLE), random (JDK with seed 50).
11:20:37.280 DEBUG     CH step (0), time spent (428), score (0hard/-5340soft), selected move count (1), picked move ((NullEmployee-nochange) 2018-12-25/D/0 {...}).
11:20:37.280 INFO  Construction Heuristic phase (0) ended: time spent (428), best score (0hard/-5340soft), score calculation speed (1000/sec), step total (1).

11:20:37.561 DEBUG Clearing dirty flag: (AF=1, BEA=0, AEA=0, BVC=0, AVC=0, BER=0, AER=0) thread=null, employeeShiftAssignmentEntityDirty={null=false}, employeeShiftAssignmentMapDirty={null=false}
11:20:44.303 DEBUG     LS step (0), time spent (7451), score (0hard/-4919soft), new best score (0hard/-4919soft), accepted/selected move count (1/300), picked move ([(WeekAlign-f) {...}, (WeekAlign-f) {...}]).
11:20:44.310 DEBUG Factories(10) STEP moves: 1594020

11:20:44.312 DEBUG Clearing dirty flag: (AF=0, BEA=0, AEA=0, BVC=13800, AVC=13800, BER=0, AER=0) thread=null, employeeShiftAssignmentEntityDirty={null=false}, employeeShiftAssignmentMapDirty={null=false}
11:20:46.609 DEBUG     LS step (1), time spent (9757), score (0hard/-5266soft),     best score (0hard/-4919soft), accepted/selected move count (1/24), picked move ((SlidePair) 2019-06-04/1/0... 1 shifts {...} <-slide-> {...} 3 shifts ...2019-06-07/1/0).
11:20:46.610 DEBUG Factories(10) STEP moves: 473969

11:20:46.613 DEBUG Clearing dirty flag: (AF=0, BEA=0, AEA=0, BVC=746, AVC=746, BER=0, AER=0) thread=null, employeeShiftAssignmentEntityDirty={null=false}, employeeShiftAssignmentMapDirty={null=false}
11:20:48.124 DEBUG     LS step (2), time spent (11272), score (0hard/-5083soft),     best score (0hard/-4919soft), accepted/selected move count (1/110), picked move ((CloseSlack-newEmplS) 2019-05-28/D/2(7 shifts) <-swap-> {...} 2019-05-21/D/3(7 shifts)).
11:20:48.124 DEBUG Factories(10) STEP moves: 477083

(the Factories debug log line after each step is just to show how many moves the 10 custom factories offered up to the solver in that previous step)

However, when I add the <moveThreadCount>1</moveThreadCount> line to the config file, then I see intermittent calls to rebuild the MoveFactoryHelper in the middle of Optaplanner making variable changes (see LS step 2, below):

10:46:05.413 INFO  Solving started: time spent (360), best score (0hard/-5340soft), environment mode (REPRODUCIBLE), random (JDK with seed 50).
10:46:05.746 DEBUG     CH step (0), time spent (693), score (0hard/-5340soft), selected move count (1), picked move ((NullEmployee-nochange) 2018-12-25/D/0 {...}).
10:46:05.746 INFO  Construction Heuristic phase (0) ended: time spent (693), best score (0hard/-5340soft), score calculation speed (9/sec), step total (1).

10:46:05.949 DEBUG Clearing dirty flag: (AF=1, BEA=0, AEA=0, BVC=0, AVC=0, BER=0, AER=0) thread=null, employeeShiftAssignmentEntityDirty={null=false}, employeeShiftAssignmentMapDirty={null=false}
10:46:13.014 DEBUG     LS step (0), time spent (7961), score (0hard/-4919soft), new best score (0hard/-4919soft), accepted/selected move count (1/300), picked move ([(WeekAlign-f) {...}, (WeekAlign-f) {...}]).
10:46:13.019 DEBUG Factories(10) STEP moves: 1594020

10:46:13.021 DEBUG Clearing dirty flag: (AF=0, BEA=0, AEA=0, BVC=13844, AVC=13844, BER=0, AER=0) thread=null, employeeShiftAssignmentEntityDirty={null=false}, employeeShiftAssignmentMapDirty={null=false}
10:46:14.741 DEBUG     LS step (1), time spent (9688), score (0hard/-5266soft),     best score (0hard/-4919soft), accepted/selected move count (1/19), picked move ((SlidePair) 2019-06-04/1/0... 1 shifts {...} <-slide-> {...} 3 shifts ...2019-06-07/1/0).
10:46:14.741 DEBUG Factories(10) STEP moves: 473969

10:46:14.743 DEBUG Clearing dirty flag: (AF=0, BEA=0, AEA=0, BVC=582, AVC=582, BER=0, AER=0) thread=null, employeeShiftAssignmentEntityDirty={null=false}, employeeShiftAssignmentMapDirty={null=false}
10:46:14.743 DEBUG Clearing dirty flag: (AF=0, BEA=0, AEA=0, BVC=20, AVC=20, BER=0, AER=0) thread=null, employeeShiftAssignmentEntityDirty={null=false}, employeeShiftAssignmentMapDirty={null=false}
10:46:16.444 DEBUG     LS step (2), time spent (11391), score (0hard/-5083soft),     best score (0hard/-4919soft), accepted/selected move count (1/97), picked move ((CloseSlack-newEmplS) {...} 2019-05-28/D/2(7 shifts) <-swap-> {...} 2019-05-21/D/3(7 shifts)).
10:46:16.445 DEBUG Factories(10) STEP moves: 1580032

Two comments: First, there is some loss of reproducible execution, for example notice there are 13800 before/after variable changes originally, and 13844 now. I am presuming this is related to the "enabling" of multithreading, even though only one thread will be used.

Second, the number of variable changes and the details of the "split", where two calls are seen to clear the dirty flags (after rebuilding MoveFactoryHelper) has some variation from run to run, leading me to think it is a multithread race issue, for example:

12:16:27.712 INFO  Solving started: time spent (375), best score (0hard/-5340soft), environment mode (REPRODUCIBLE), random (JDK with seed 50).
12:16:28.043 DEBUG     CH step (0), time spent (706), score (0hard/-5340soft), selected move count (1), picked move ((NullEmployee-nochange) 2018-12-25/D/0 {...}).
12:16:28.043 INFO  Construction Heuristic phase (0) ended: time spent (706), best score (0hard/-5340soft), score calculation speed (9/sec), step total (1).

12:16:28.288 DEBUG Clearing dirty flag: (AF=1, BEA=0, AEA=0, BVC=0, AVC=0, BER=0, AER=0) thread=null, employeeShiftAssignmentEntityDirty={null=false}, employeeShiftAssignmentMapDirty={null=false}
12:16:35.148 DEBUG     LS step (0), time spent (7811), score (0hard/-4919soft), new best score (0hard/-4919soft), accepted/selected move count (1/300), picked move ([(WeekAlign-f) {...}, (WeekAlign-f) {...}]).
12:16:35.158 DEBUG Factories(10) STEP moves: 1594020

12:16:35.160 DEBUG Clearing dirty flag: (AF=0, BEA=0, AEA=0, BVC=13821, AVC=13821, BER=0, AER=0) thread=null, employeeShiftAssignmentEntityDirty={null=false}, employeeShiftAssignmentMapDirty={null=false}
12:16:35.160 DEBUG Clearing dirty flag: (AF=0, BEA=0, AEA=0, BVC=0, AVC=0, BER=0, AER=0) thread=null, employeeShiftAssignmentEntityDirty={null=false}, employeeShiftAssignmentMapDirty={null=false}
12:16:37.050 DEBUG     LS step (1), time spent (9713), score (0hard/-5266soft),     best score (0hard/-4919soft), accepted/selected move count (1/22), picked move ((SlidePair) 2019-06-04/1/0... 1 shifts {...} <-slide-> {...} 3 shifts ...2019-06-07/1/0).
12:16:37.053 DEBUG Factories(10) STEP moves: 1576812

12:16:37.054 DEBUG Clearing dirty flag: (AF=0, BEA=0, AEA=0, BVC=763, AVC=763, BER=0, AER=0) thread=null, employeeShiftAssignmentEntityDirty={null=false}, employeeShiftAssignmentMapDirty={null=false}
12:16:37.055 DEBUG Clearing dirty flag: (AF=0, BEA=0, AEA=0, BVC=23, AVC=23, BER=0, AER=0) thread=null, employeeShiftAssignmentEntityDirty={null=false}, employeeShiftAssignmentMapDirty={null=false}
12:16:39.414 DEBUG     LS step (2), time spent (12077), score (0hard/-5083soft),     best score (0hard/-4919soft), accepted/selected move count (1/98), picked move ((CloseSlack-newEmplS) {...} 2019-05-28/D/2(7 shifts) <-swap-> {...} 2019-05-21/D/3(7 shifts)).
12:16:39.414 DEBUG Factories(10) STEP moves: 1580534

Thus my two questions:

  1. Is it right that Optaplanner does not behave the same with no moveThreadCount definition versus a definition of 1? That seems unexpected for a user.

  2. What might I, or Optaplanner, be doing that is causing the early call to a custom MoveFactory (to generate Move lists) before all Optaplanner's variable changes are complete after the previous step, even in a single-thread configuration? I'm wondering if the "chosen move" is implemented and new createMoveList calls begin before the last of the move-scoring/testing threads from the "last" move are all paused. Even if so, I don't know why that would result in non-reproducible execution here, unless the still-running threads are making random move selections (which would seem to yield a non-reproducible execution overall).

This is occurring in both "run" and "debug" execution environments.

Thank you.


Solution

  • Appears resolved by Optaplanner 7.15.0. Updating to this release will fix the problem.