Search code examples
javaarraysstringset-intersection

Cannot find correct intersection of two string arrays when there is a comma in the strings


I have two CSV files: "userfeatures" and "itemfeatures". Each line in the userfeature is related to specific user. e.g., the first line in the userfeature file is:

005c2e08","Action","nm0000148","dir_ nm0764316","USA"

I need to find the intersection of this line with every line of the 2nd file "itemfeatures". (Actually , I need to repeat this procedure for all the users, i.e, for all lines of "userfeatures").

So, the first comparison will be with the first line of "itemfeatures" that is:

"tt0306047","Comedy,Action","nm0267506,nm0000221,nm0356021","dir_ nm0001878","USA"

The result of intersection should be ["Action", "USA]" but unfortunately, my code only finds ["USA"] as a match. Here is what I've tried so far:

public class Main {
  public static void main(String[] args) throws Exception {   
     BufferedReader userfeatures = new BufferedReader(new FileReader("userFeatureVectorsTest.csv"));
     BufferedReader itemfeatures = new BufferedReader(new FileReader("ItemFeatureVectorsTest.csv"));       
     ArrayList<String> userlines = new ArrayList<>();
     ArrayList<String> itemlines = new ArrayList<>();
     String Uline = null;      
        while ((Uline = userfeatures.readLine()) != null) {
            for (String Iline = itemfeatures.readLine(); Iline != null; Iline = itemfeatures.readLine()) {
                System.out.println(Uline); 
                System.out.println(Iline);                
                System.out.println(intersect(Uline, Iline)); 
                System.out.println(union(Uline, Iline)); 
            }
        }
 userfeatures.close();
 itemfeatures.close();
 }    
  static Set<String> intersect(String Uline, String Iline) {
      Set<String> result = new HashSet<String>(Arrays.asList(Uline.split(",")));
      Set<String> IlineSet = new HashSet<String>(Arrays.asList(Iline.split(",")));
      result.retainAll(IlineSet);
      return result;
   }  
  static Set<String> union(String Uline, String Iline) {
      Set<String> result = new HashSet<String>(Arrays.asList(Uline.split(",")));
      Set<String> IlineSet = new HashSet<String>(Arrays.asList(Iline.split(",")));
      result.addAll(IlineSet);
      return result;
   }
}

I think the problem is related to Uline.split(",") and Iline.split(",") because they consider "Comedy,Action" as 1 word and so it cannot find [Action] as intersection of "Comedy,Action" and "Action". I appreciate it if someone has any idea how to fix this issue.


Solution

  • Try removing the double quotes in both strings .

    Because when you split

    "tt0306047","Comedy,Action","nm0267506,nm0000221,nm0356021","dir_ nm0001878","USA"

    You will get an

    Action"

    token , which will never match the

    "Action"

    token.