Search code examples
pythondataframeselectuuid

select rows in dataframe not in different dataframe with different shape arrays and where object is not iterable


I'm trying to select rows in a dataframe that don't share a uniqueID with an interim dataframe but when I use the following NotProcessed= FileListDF.loc[FileListDF['DocID'] != InterimDF.DocID.tolist()] I get

ValueError: Arrays were different lengths: 255 vs 246

when I'm expecting a dataframe containing 9 rows that are in FileListDF but not in InterimDF

This question appears to have encountered a similar issue but the answer isn't clear and when I try it I get

'UUID' object is not iterable

I'm not sure if its connected to me using uuid.uuid4() to create the DocID in the 1st instance

Sample Data

FileListDF..to_json

<bound method NDFrame.to_json of     Class                                 DocID
0      CC  c2571c68-0705-4c87-a42b-0971a73fedf4
1      CC  ab317908-8720-47ce-a8f6-453169259e48
2      CC  2b3d1aee-b4f6-4278-9db2-c7b6819bd14f
3      CC  9d4b9084-f623-445c-86d9-0458d6fb4135
4      CC  afe9dbb5-38a2-4c4e-9640-b1da1d37b66f
5      CC  56fa7edb-5020-44e7-98bc-5b0878fbf9ac
6      CC  7c27bcb8-b381-465e-a6e9-edc38cf24e5a
7      CC  a56e6f13-bf05-474a-9930-5f0ebad33eba
8      CC  81c31d76-cec3-4f1c-8ec1-107ecc1169d4
9      CC  1f8acf61-0a08-4b42-8ad1-38eeebc85ff2
10     CC  d6a6704c-21c6-4dc5-b9bd-de9e67ee7970
11     CC  d63dde24-9a2b-410d-aa7d-6a854b646a12
12     CC  a80367f9-36af-45bd-878e-d19461a3d3df
13     CC  31f74aff-e8df-43bb-a7d4-314caea6de20
14     CC  99052ee4-2481-4c81-92dd-81b446e76c15
15     CC  3827d0b4-5d0a-490a-a561-bf631bae6492
16     CC  fdbde364-776d-4532-8aca-874144a85df8
17     CC  4dfa630f-20b0-4a9e-bcf0-cdd4f18bb215
18     CC  f5b16b14-a058-4f2d-8461-25e68b02c68b
19     CC  a9f23eee-5b44-4d4e-baf8-19b78b4e3508
20     CC  344bfb49-3b22-4e2a-b815-a9eb46cbed1b
21     CC  ade5b2c3-5962-4d6d-9252-4f670b127682
22     CC  43fa1fb2-3681-4372-ab5a-f12059b3741c
23     CC  77f19056-fef8-4353-9740-01afa5557f47
24     CC  1c7cfc8a-57c6-4ad3-bce7-63b006352ed1
25     CC  e6db5e3c-4b3e-4b27-9558-e7334032c9c0
26     CC  68adb3b6-dfeb-4645-8ed2-610ae896e8c4
27     CC  c0163ee2-9e34-4636-a734-051151c50f29
28     CC  7b7d8549-8697-4b19-a033-e0068f5aa4d7
29     CC  7a976925-57c5-4972-a9e6-278dd322f6b4
..    ...                                   ...
225    JA  cd3a901e-a077-472e-a100-f082f011cc72
226    JA  d462729a-e1bc-4502-abc1-8dbd2b21103d
227    JA  f5338d8d-13dc-4a03-b720-1056419deb68
228    JA  37a41e22-3698-459d-b3cd-0d419a0f3f48
229    JA  e48c4463-565d-4f09-9ccc-03e0506ddf1e
230    JA  8531a863-4b0b-4bfa-9c4a-05056e7c1165
231    JA  01e72b43-730d-42a7-9405-6fd45b2141b5
232    JA  5ebbf1d4-48f1-4d1f-a5bc-1e865c92146f
233    JA  c3103348-2209-42de-8005-06a9af3ca43f
234    JA  b603b2cd-c4c9-4a2e-a52f-361a1e06b277
235    JA  32bf7eba-7425-4f9c-ad1f-95680c8ad7cc
236    JA  514f667e-5ba8-40fa-8c6f-bbf06bb6ff2f
237    JA  d08b06d1-b8fc-432b-94af-a25d9ef8b4fe
238    JA  d32419e8-e327-470b-95ba-1a2e20cff6d7
239    NA  abdd46ff-3b81-4b5e-85a1-95db783ff2fa
240    NA  3587088d-a592-4f21-93fe-760e9276fbd9
241    NA  8eb7ae93-ab3a-4e3c-ad42-129949dd0545
242    NA  8ff017c0-aae2-43e8-b87b-2889b9bb0df8
243    NA  1ad7c9d6-8da4-46e8-97ee-c0a6c3299d27
244    NA  fa5e928f-a906-480f-824c-4615a155cc2a
245    NA  1d1b709d-3472-4bd1-961b-453368b6845a
246    NA  e302d87f-30d7-46aa-9358-5705a31cf31b
247    NA  ad1043ba-5807-482b-a4a0-624bbfbd36af
248    NA  94c4453f-1309-4931-9a8f-5400a1b104f5
249    NA  37aca929-a2fd-4055-a416-865b92ebe58c
250    NA  ad64522f-054a-4fcc-b1c2-fbf70eb06a62
251    NA  c30a75f1-c144-4d62-a3bb-9f41615b9ca0
252    NA  559c0a5e-7e76-4e4f-a560-1c4d52b31384
253    NA  5a123bb7-9ffb-46b2-b315-6e2f0aca7aed
254    NA  595da36e-ec30-40e4-abc1-b98400ae9303

This is the dataframe of DocIDs that I want to remove matching values from FileListDF above to find the same number of rows that are in FileListDF but not in InterimDF

InterimDF.to_json

<bound method NDFrame.to_json of                                     DocID
0    c2571c68-0705-4c87-a42b-0971a73fedf4
1    ab317908-8720-47ce-a8f6-453169259e48
2    2b3d1aee-b4f6-4278-9db2-c7b6819bd14f
3    9d4b9084-f623-445c-86d9-0458d6fb4135
4    56fa7edb-5020-44e7-98bc-5b0878fbf9ac
5    7c27bcb8-b381-465e-a6e9-edc38cf24e5a
6    a56e6f13-bf05-474a-9930-5f0ebad33eba
7    81c31d76-cec3-4f1c-8ec1-107ecc1169d4
8    1f8acf61-0a08-4b42-8ad1-38eeebc85ff2
9    d6a6704c-21c6-4dc5-b9bd-de9e67ee7970
10   d63dde24-9a2b-410d-aa7d-6a854b646a12
11   a80367f9-36af-45bd-878e-d19461a3d3df
12   31f74aff-e8df-43bb-a7d4-314caea6de20
13   99052ee4-2481-4c81-92dd-81b446e76c15
14   3827d0b4-5d0a-490a-a561-bf631bae6492
15   fdbde364-776d-4532-8aca-874144a85df8
16   4dfa630f-20b0-4a9e-bcf0-cdd4f18bb215
17   f5b16b14-a058-4f2d-8461-25e68b02c68b
18   a9f23eee-5b44-4d4e-baf8-19b78b4e3508
19   344bfb49-3b22-4e2a-b815-a9eb46cbed1b
20   ade5b2c3-5962-4d6d-9252-4f670b127682
21   43fa1fb2-3681-4372-ab5a-f12059b3741c
22   77f19056-fef8-4353-9740-01afa5557f47
23   1c7cfc8a-57c6-4ad3-bce7-63b006352ed1
24   e6db5e3c-4b3e-4b27-9558-e7334032c9c0
25   68adb3b6-dfeb-4645-8ed2-610ae896e8c4
26   c0163ee2-9e34-4636-a734-051151c50f29
27   7b7d8549-8697-4b19-a033-e0068f5aa4d7
28   7a976925-57c5-4972-a9e6-278dd322f6b4
29   6533a84b-a81d-411a-b44e-3090cd01396d
..                                    ...
216  f7217b94-b939-4bcf-b8c8-d385e4473c63
217  74522cf3-7fd5-4b84-b175-23f84ede25ae
218  cd3a901e-a077-472e-a100-f082f011cc72
219  d462729a-e1bc-4502-abc1-8dbd2b21103d
220  f5338d8d-13dc-4a03-b720-1056419deb68
221  37a41e22-3698-459d-b3cd-0d419a0f3f48
222  e48c4463-565d-4f09-9ccc-03e0506ddf1e
223  8531a863-4b0b-4bfa-9c4a-05056e7c1165
224  01e72b43-730d-42a7-9405-6fd45b2141b5
225  5ebbf1d4-48f1-4d1f-a5bc-1e865c92146f
226  c3103348-2209-42de-8005-06a9af3ca43f
227  b603b2cd-c4c9-4a2e-a52f-361a1e06b277
228  d08b06d1-b8fc-432b-94af-a25d9ef8b4fe
229  d32419e8-e327-470b-95ba-1a2e20cff6d7
230  abdd46ff-3b81-4b5e-85a1-95db783ff2fa
231  3587088d-a592-4f21-93fe-760e9276fbd9
232  8eb7ae93-ab3a-4e3c-ad42-129949dd0545
233  8ff017c0-aae2-43e8-b87b-2889b9bb0df8
234  1ad7c9d6-8da4-46e8-97ee-c0a6c3299d27
235  fa5e928f-a906-480f-824c-4615a155cc2a
236  1d1b709d-3472-4bd1-961b-453368b6845a
237  e302d87f-30d7-46aa-9358-5705a31cf31b
238  ad1043ba-5807-482b-a4a0-624bbfbd36af
239  94c4453f-1309-4931-9a8f-5400a1b104f5
240  37aca929-a2fd-4055-a416-865b92ebe58c
241  ad64522f-054a-4fcc-b1c2-fbf70eb06a62
242  c30a75f1-c144-4d62-a3bb-9f41615b9ca0
243  559c0a5e-7e76-4e4f-a560-1c4d52b31384
244  5a123bb7-9ffb-46b2-b315-6e2f0aca7aed
245  595da36e-ec30-40e4-abc1-b98400ae9303

Solution

  • So if I understand correctly, you have two dataframes that each have a column with some overlap between the two.

    For simplicity, I'm going to call them df_one and df_two and they both have an id column.

    You can use isin() to filter down to just the overlap, or the converse to filter those out.

    eg

    import pandas as pd
    
    df_one = pd.DataFrame(data={'id':[1,2,3,4,5]})
    df_two = pd.DataFrame(data={'id':[3,4,5,6,7]})
    
    new_ids = ~df_one['id'].isin(df_two['id'])
    
    df_only_in_one = df_one[new_ids]