Trying to do a simple set
difference in python, and getting results that indicate the the difference operator is doing nothing. Eg. have code
python version 2.7.15+
assert isinstance(left_frame, h2o.H2OFrame)
assert isinstance(right_frame, h2o.H2OFrame)
assert isinstance(left_key, str)
assert isinstance(right_key, str)
# ensure that the primary_key exists in both frames
assert left_key in left_frame.columns, 'left_key: {} does not exist in left_frame'.format(left_key)
assert right_key in right_frame.columns, 'right_key: {} does not exist in right_frame'.format(right_key)
# ensure that the primary_key is the only common column between the left and right frame
left_non_pk_cols = set(left_frame.columns) - set(left_key)
assert left_on not in left_non_pk_cols, '%s' % left_key
right_non_pk_cols = set(right_frame.columns) - set(right_key)
assert right_on not in right_non_pk_cols, '%s' % right_key
left_non_pk_cols_in_right = left_non_pk_cols.intersection(right_non_pk_cols)
assert len(left_non_pk_cols_in_right) == 0,\
'The primary_key is not the only common column between frames, h2o merge will not work as expected\n%s\n%s\n%s' \
% (left_non_pk_cols, right_non_pk_cols, left_non_pk_cols_in_right)
I get the error
assert left_key not in left_non_pk_cols, '%s' % left_key
AssertionError: <the left_key value>
This is really weird to me. Running in a terminal (with same python version) for a simplified example case
assert u'1' not in (set([u'1', u'2', u'3']) - set(u'1'))
# noting that the H2OFrames `.columns` field is a list of unicode strings
throws no error at all and completes as expected (when printing the resulting set
, everything looks as it should (no u'1'
element)).
Using the .difference()
method rather than the -
operator does not result in any difference either.
Does anyone know what could be going on here or other things to do to get more debugging info?
The argument to set()
is an iterable, and it creates a set of each element of the iterable. So if left_key
is a string, set(left_key)
will create a set of each of the unique characters of the string, not a set whose element is the string.
The solution is to use set([left_key])
. The argument will be the list, and the set will then contain its single element, the string. Or you can just use a set literal {left_key}
left_non_pk_cols = set(left_frame.columns) - {left_key}
Another way is to just remove the element from the set.
left_non_pk_cols = set(left_frame.columns)
left_non_pk.cols.discard(left_key)
I used discard
rather than remove
because it doesn't signal an error if the element isn't found.