I'm trying to select rows in a dataframe that don't share a uniqueID with an interim dataframe but when I use the following
NotProcessed= FileListDF.loc[FileListDF['DocID'] != InterimDF.DocID.tolist()]
I get
ValueError: Arrays were different lengths: 255 vs 246
when I'm expecting a dataframe containing 9 rows that are in FileListDF but not in InterimDF
This question appears to have encountered a similar issue but the answer isn't clear and when I try it I get
'UUID' object is not iterable
I'm not sure if its connected to me using uuid.uuid4()
to create the DocID in the 1st instance
Sample Data
FileListDF..to_json
<bound method NDFrame.to_json of Class DocID
0 CC c2571c68-0705-4c87-a42b-0971a73fedf4
1 CC ab317908-8720-47ce-a8f6-453169259e48
2 CC 2b3d1aee-b4f6-4278-9db2-c7b6819bd14f
3 CC 9d4b9084-f623-445c-86d9-0458d6fb4135
4 CC afe9dbb5-38a2-4c4e-9640-b1da1d37b66f
5 CC 56fa7edb-5020-44e7-98bc-5b0878fbf9ac
6 CC 7c27bcb8-b381-465e-a6e9-edc38cf24e5a
7 CC a56e6f13-bf05-474a-9930-5f0ebad33eba
8 CC 81c31d76-cec3-4f1c-8ec1-107ecc1169d4
9 CC 1f8acf61-0a08-4b42-8ad1-38eeebc85ff2
10 CC d6a6704c-21c6-4dc5-b9bd-de9e67ee7970
11 CC d63dde24-9a2b-410d-aa7d-6a854b646a12
12 CC a80367f9-36af-45bd-878e-d19461a3d3df
13 CC 31f74aff-e8df-43bb-a7d4-314caea6de20
14 CC 99052ee4-2481-4c81-92dd-81b446e76c15
15 CC 3827d0b4-5d0a-490a-a561-bf631bae6492
16 CC fdbde364-776d-4532-8aca-874144a85df8
17 CC 4dfa630f-20b0-4a9e-bcf0-cdd4f18bb215
18 CC f5b16b14-a058-4f2d-8461-25e68b02c68b
19 CC a9f23eee-5b44-4d4e-baf8-19b78b4e3508
20 CC 344bfb49-3b22-4e2a-b815-a9eb46cbed1b
21 CC ade5b2c3-5962-4d6d-9252-4f670b127682
22 CC 43fa1fb2-3681-4372-ab5a-f12059b3741c
23 CC 77f19056-fef8-4353-9740-01afa5557f47
24 CC 1c7cfc8a-57c6-4ad3-bce7-63b006352ed1
25 CC e6db5e3c-4b3e-4b27-9558-e7334032c9c0
26 CC 68adb3b6-dfeb-4645-8ed2-610ae896e8c4
27 CC c0163ee2-9e34-4636-a734-051151c50f29
28 CC 7b7d8549-8697-4b19-a033-e0068f5aa4d7
29 CC 7a976925-57c5-4972-a9e6-278dd322f6b4
.. ... ...
225 JA cd3a901e-a077-472e-a100-f082f011cc72
226 JA d462729a-e1bc-4502-abc1-8dbd2b21103d
227 JA f5338d8d-13dc-4a03-b720-1056419deb68
228 JA 37a41e22-3698-459d-b3cd-0d419a0f3f48
229 JA e48c4463-565d-4f09-9ccc-03e0506ddf1e
230 JA 8531a863-4b0b-4bfa-9c4a-05056e7c1165
231 JA 01e72b43-730d-42a7-9405-6fd45b2141b5
232 JA 5ebbf1d4-48f1-4d1f-a5bc-1e865c92146f
233 JA c3103348-2209-42de-8005-06a9af3ca43f
234 JA b603b2cd-c4c9-4a2e-a52f-361a1e06b277
235 JA 32bf7eba-7425-4f9c-ad1f-95680c8ad7cc
236 JA 514f667e-5ba8-40fa-8c6f-bbf06bb6ff2f
237 JA d08b06d1-b8fc-432b-94af-a25d9ef8b4fe
238 JA d32419e8-e327-470b-95ba-1a2e20cff6d7
239 NA abdd46ff-3b81-4b5e-85a1-95db783ff2fa
240 NA 3587088d-a592-4f21-93fe-760e9276fbd9
241 NA 8eb7ae93-ab3a-4e3c-ad42-129949dd0545
242 NA 8ff017c0-aae2-43e8-b87b-2889b9bb0df8
243 NA 1ad7c9d6-8da4-46e8-97ee-c0a6c3299d27
244 NA fa5e928f-a906-480f-824c-4615a155cc2a
245 NA 1d1b709d-3472-4bd1-961b-453368b6845a
246 NA e302d87f-30d7-46aa-9358-5705a31cf31b
247 NA ad1043ba-5807-482b-a4a0-624bbfbd36af
248 NA 94c4453f-1309-4931-9a8f-5400a1b104f5
249 NA 37aca929-a2fd-4055-a416-865b92ebe58c
250 NA ad64522f-054a-4fcc-b1c2-fbf70eb06a62
251 NA c30a75f1-c144-4d62-a3bb-9f41615b9ca0
252 NA 559c0a5e-7e76-4e4f-a560-1c4d52b31384
253 NA 5a123bb7-9ffb-46b2-b315-6e2f0aca7aed
254 NA 595da36e-ec30-40e4-abc1-b98400ae9303
This is the dataframe of DocIDs that I want to remove matching values from FileListDF above to find the same number of rows that are in FileListDF but not in InterimDF
InterimDF.to_json
<bound method NDFrame.to_json of DocID
0 c2571c68-0705-4c87-a42b-0971a73fedf4
1 ab317908-8720-47ce-a8f6-453169259e48
2 2b3d1aee-b4f6-4278-9db2-c7b6819bd14f
3 9d4b9084-f623-445c-86d9-0458d6fb4135
4 56fa7edb-5020-44e7-98bc-5b0878fbf9ac
5 7c27bcb8-b381-465e-a6e9-edc38cf24e5a
6 a56e6f13-bf05-474a-9930-5f0ebad33eba
7 81c31d76-cec3-4f1c-8ec1-107ecc1169d4
8 1f8acf61-0a08-4b42-8ad1-38eeebc85ff2
9 d6a6704c-21c6-4dc5-b9bd-de9e67ee7970
10 d63dde24-9a2b-410d-aa7d-6a854b646a12
11 a80367f9-36af-45bd-878e-d19461a3d3df
12 31f74aff-e8df-43bb-a7d4-314caea6de20
13 99052ee4-2481-4c81-92dd-81b446e76c15
14 3827d0b4-5d0a-490a-a561-bf631bae6492
15 fdbde364-776d-4532-8aca-874144a85df8
16 4dfa630f-20b0-4a9e-bcf0-cdd4f18bb215
17 f5b16b14-a058-4f2d-8461-25e68b02c68b
18 a9f23eee-5b44-4d4e-baf8-19b78b4e3508
19 344bfb49-3b22-4e2a-b815-a9eb46cbed1b
20 ade5b2c3-5962-4d6d-9252-4f670b127682
21 43fa1fb2-3681-4372-ab5a-f12059b3741c
22 77f19056-fef8-4353-9740-01afa5557f47
23 1c7cfc8a-57c6-4ad3-bce7-63b006352ed1
24 e6db5e3c-4b3e-4b27-9558-e7334032c9c0
25 68adb3b6-dfeb-4645-8ed2-610ae896e8c4
26 c0163ee2-9e34-4636-a734-051151c50f29
27 7b7d8549-8697-4b19-a033-e0068f5aa4d7
28 7a976925-57c5-4972-a9e6-278dd322f6b4
29 6533a84b-a81d-411a-b44e-3090cd01396d
.. ...
216 f7217b94-b939-4bcf-b8c8-d385e4473c63
217 74522cf3-7fd5-4b84-b175-23f84ede25ae
218 cd3a901e-a077-472e-a100-f082f011cc72
219 d462729a-e1bc-4502-abc1-8dbd2b21103d
220 f5338d8d-13dc-4a03-b720-1056419deb68
221 37a41e22-3698-459d-b3cd-0d419a0f3f48
222 e48c4463-565d-4f09-9ccc-03e0506ddf1e
223 8531a863-4b0b-4bfa-9c4a-05056e7c1165
224 01e72b43-730d-42a7-9405-6fd45b2141b5
225 5ebbf1d4-48f1-4d1f-a5bc-1e865c92146f
226 c3103348-2209-42de-8005-06a9af3ca43f
227 b603b2cd-c4c9-4a2e-a52f-361a1e06b277
228 d08b06d1-b8fc-432b-94af-a25d9ef8b4fe
229 d32419e8-e327-470b-95ba-1a2e20cff6d7
230 abdd46ff-3b81-4b5e-85a1-95db783ff2fa
231 3587088d-a592-4f21-93fe-760e9276fbd9
232 8eb7ae93-ab3a-4e3c-ad42-129949dd0545
233 8ff017c0-aae2-43e8-b87b-2889b9bb0df8
234 1ad7c9d6-8da4-46e8-97ee-c0a6c3299d27
235 fa5e928f-a906-480f-824c-4615a155cc2a
236 1d1b709d-3472-4bd1-961b-453368b6845a
237 e302d87f-30d7-46aa-9358-5705a31cf31b
238 ad1043ba-5807-482b-a4a0-624bbfbd36af
239 94c4453f-1309-4931-9a8f-5400a1b104f5
240 37aca929-a2fd-4055-a416-865b92ebe58c
241 ad64522f-054a-4fcc-b1c2-fbf70eb06a62
242 c30a75f1-c144-4d62-a3bb-9f41615b9ca0
243 559c0a5e-7e76-4e4f-a560-1c4d52b31384
244 5a123bb7-9ffb-46b2-b315-6e2f0aca7aed
245 595da36e-ec30-40e4-abc1-b98400ae9303
So if I understand correctly, you have two dataframes that each have a column with some overlap between the two.
For simplicity, I'm going to call them df_one
and df_two
and they both have an id
column.
You can use isin()
to filter down to just the overlap, or the converse to filter those out.
eg
import pandas as pd
df_one = pd.DataFrame(data={'id':[1,2,3,4,5]})
df_two = pd.DataFrame(data={'id':[3,4,5,6,7]})
new_ids = ~df_one['id'].isin(df_two['id'])
df_only_in_one = df_one[new_ids]