Search code examples
javadata-structuresarray-algorithms

reading txt file and recording it's pages - JAVA


I'm a newbie software student and this is my first time around here so sorry if I'm posting on the wrong place. I have an assignment that consists in reading a text file with A LOT of lines (40 of them makes a page) splitting it into words and, for each word, recording all the occurrences as well as all the pages where it happened.

The point is that I'm only allowed to use linked lists (I can create my own methods) and arrays. Ive spent a good time at it but I was only able to store the splitted words and even so Im struggling with the logical part of recording the pages and frequencies of the words... Where do I store the page numbers for each word? should it be in an array? or maybe I Should create a "Word" class to store the occurrences and the page numbers? if so, should I also create a "Page" class to manage it? Honestly I've already tried both of them but none seemed to work for me and in the end the code got slow, messy and I only got confused lol

Already thanking you people for any help!!!

EDIT: here's what i have for the reading and splitting part of the problem. This time i decided to store all of the words in a linkedList. Here i changed to 3 lines = 1 page. I also left a few lines of the text at the end.Problem is i dont know where and how to keep track of the page numbers for each 3 lines group

public void loadBook(){  
    Path path1 = Paths.get("alice.txt");
    int countLines = 0;
    stopwords(); //another method to load the stopwords
    try (BufferedReader reader = Files.newBufferedReader(path1, Charset.defaultCharset())) {
        String line = null;
        while ((line = reader.readLine()) != null) {
            String[] split = line.split(" ");
            ++countLines;
            for (int i = 0; i < split.length; ++i){
                if (stopwords.notContains(split[i].toLowerCase())){
                    pages.add(split[i]);
                }
            }   
            if (countLines % 3 == 0) {
                countPages++;
                countLines = 0;
            }
        }
    } catch (IOException e) {
        System.err.format("Erro na leitura do arquivo: ", e);
    }
}

CHAPTER I. Down the Rabbit-Hole

Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, 'and what is the use of a book,' thought Alice 'without pictures or conversations?'

So she was considering in her own mind (as well as she could, for the hot day made her feel very sleepy and stupid), whether the pleasure of making a daisy-chain would be worth the trouble of getting up and picking the daisies, when suddenly a White Rabbit with pink eyes ran close by her.

There was nothing so VERY remarkable in that; nor did Alice think it so VERY much out of the way to hear the Rabbit say to itself, 'Oh dear!


Solution

  • First I thought about a class to encapsulate the data you require to extract from the text. You need individual words and for each word you need both a count of how many times that word appears in the entire text as well as a list of pages where that word occurs. So I wrote the following WordRef class.

    import java.util.LinkedList;
    import java.util.Objects;
    
    public class WordRef {
        /** Number of times 'word' occurs in the text. */
        private int occurrences;
    
        /** The actual word. */
        private String word;
    
        /** List of page numbers where 'word' appears. */
        private LinkedList<Integer> pages;
    
        /**
         * Creates and returns instance of this class.
         * 
         * @param word - the actual word.
         */
        public WordRef(String word) {
            Objects.requireNonNull(word, "null word");
            this.word = word;
            occurrences = 1;
            pages = new LinkedList<Integer>();
        }
    
        /** Increment the number of occurrences of 'word'. */
        public void addOccurrence() {
            occurrences++;
        }
    
        /**
         * Add 'page' to the list of pages containing 'word'.
         * 
         * @param page - number of page to add.
         */
        public void addPage(Integer page) {
            if (!pages.contains(page)) {
                pages.add(page);
            }
        }
    
        /**
         * @return Number of occurrences of 'word'.
         */
        public int getOccurrences() {
            return occurrences;
        }
    
        /**
         * @return The actual 'word'.
         */
        public String getWord() {
            return word;
        }
    
        /**
         * Two 'WordRef' instances are equal if they both contain the exact, same word.
         */
        public boolean equals(Object obj) {
            boolean equal = false;
            if (obj != null) {
                Class<?> objClass = obj.getClass();
                if (objClass.equals(getClass())) {
                    WordRef other = (WordRef) obj;
                    String otherWord = other.getWord();
                    equal = word.equals(otherWord);
                }
            }
            return equal;
        }
    
        /**
         * Equal 'WordRef' instances should each return the same hash code.
         */
        public int hashCode() {
            return word.hashCode();
        }
    
        /**
         * Returns a string representation of this instance.
         */
        public String toString() {
            return String.format("%s {%d} %s", word, occurrences, pages);
        }
    }
    

    Note that the elements of a LinkedList must be objects, hence the use of Integer rather than int, since int is a primitive. Also note that we need to determine if two WordRef instances contain the same word. Hence class WordRef contains method equals() and according to the javadoc for class java.lang.Object, if a class overrides method equals() then it should also override method hashCode().

    Now for the code that reads the text and processes it. In your question you placed all that code in a method named loadBook(). However for the purposes of creating a minimal, reproducible example, I wrote a separate class and put the text reading and processing code into method main() as well as some helper methods. Here is the code for that class.

    import java.io.BufferedReader;
    import java.io.IOException;
    import java.nio.charset.Charset;
    import java.nio.file.Files;
    import java.nio.file.Path;
    import java.nio.file.Paths;
    import java.util.ArrayList;
    import java.util.LinkedList;
    import java.util.List;
    import java.util.regex.Matcher;
    import java.util.regex.Pattern;
    
    public class AliceTxt {
        private static final int PAGE = 3;
        private static final Pattern REGEX = Pattern.compile("\\b\\w+\\b");
    
        private static LinkedList<WordRef> wordRefs;
    
        private static List<String> getWords(String line) {
            if (line == null) {
                line = "";
            }
            Matcher matcher = REGEX.matcher(line);
            List<String> words = new ArrayList<>();
            while (matcher.find()) {
                words.add(matcher.group());
            }
            return words;
        }
    
        private static void updateWordRefs(List<String> words, int page) {
            if (words != null) {
                for (String word : words) {
                    WordRef wordRef = new WordRef(word);
                    int index = wordRefs.indexOf(wordRef);
                    if (index < 0) {
                        wordRefs.add(wordRef);
                    }
                    else {
                        wordRef = wordRefs.get(index);
                        wordRef.addOccurrence();
                    }
                    wordRef.addPage(Integer.valueOf(page));
                }
            }
        }
    
        public static void main(String[] args) {
            Path path1 = Paths.get("alice.txt");
            try (BufferedReader reader = Files.newBufferedReader(path1, Charset.defaultCharset())) {
                wordRefs = new LinkedList<>();
                String line = reader.readLine();
                int countLines = 0;
                int page;
                while (line != null) {
                    page = (countLines / PAGE) + 1;
                    if (line.length() > 0) {
                        // Don't count empty lines.
                        countLines++;
                    }
                    List<String> words = getWords(line);
                    updateWordRefs(words, page);
                    line = reader.readLine();
                }
                wordRefs.forEach(System.out::println);
            }
            catch (IOException xIo) {
                xIo.printStackTrace();
            }
        }
    }
    

    The above class uses another LinkedList to hold all the distinct words in the text as separate WordRef objects. Note that in the above code words are case-sensitive which means that So and so are considered separate words. If you want the words to be case insensitive, i.e. So and so should be considered the same word, then use the following method from class java.util.regex.Pattern

    private static final Pattern REGEX = Pattern.compile("\\b\\w+\\b", Pattern.CASE_INSENSITIVE);
    

    Below is the output of running the above code according to the description in my comment to your question of how you want the output to appear and which you confirmed as being a correct description.
    Each line below begins with the actual word, followed by the number of occurrences and then the list of page numbers where the word occurs in the text. Refer to method toString() in class WordRef.

    Alice {3} [1, 3]
    was {4} [1, 2, 3]
    beginning {1} [1]
    to {4} [1, 3]
    get {1} [1]
    very {2} [1, 2]
    tired {1} [1]
    of {6} [1, 2, 3]
    sitting {1} [1]
    by {2} [1, 2]
    her {5} [1, 2]
    sister {2} [1]
    on {1} [1]
    the {9} [1, 2, 3]
    bank {1} [1]
    and {4} [1, 2]
    having {1} [1]
    nothing {2} [1, 3]
    do {1} [1]
    once {1} [1]
    or {3} [1]
    twice {1} [1]
    she {3} [1, 2]
    had {2} [1]
    peeped {1} [1]
    into {1} [1]
    book {2} [1]
    reading {1} [1]
    but {1} [1]
    it {3} [1, 3]
    no {1} [1]
    pictures {2} [1]
    conversations {2} [1]
    in {3} [1, 2, 3]
    what {1} [1]
    is {1} [1]
    use {1} [1]
    a {3} [1, 2]
    thought {1} [1]
    without {1} [1]
    So {1} [2]
    considering {1} [2]
    own {1} [2]
    mind {1} [2]
    as {2} [2]
    well {1} [2]
    could {1} [2]
    for {1} [2]
    hot {1} [2]
    day {1} [2]
    made {1} [2]
    feel {1} [2]
    sleepy {1} [2]
    stupid {1} [2]
    whether {1} [2]
    pleasure {1} [2]
    making {1} [2]
    daisy {1} [2]
    chain {1} [2]
    would {1} [2]
    be {1} [2]
    worth {1} [2]
    trouble {1} [2]
    getting {1} [2]
    up {1} [2]
    picking {1} [2]
    daisies {1} [2]
    when {1} [2]
    suddenly {1} [2]
    White {1} [2]
    Rabbit {2} [2, 3]
    with {1} [2]
    pink {1} [2]
    eyes {1} [2]
    ran {1} [2]
    close {1} [2]
    There {1} [3]
    so {2} [3]
    VERY {2} [3]
    remarkable {1} [3]
    that {1} [3]
    nor {1} [3]
    did {1} [3]
    think {1} [3]
    much {1} [3]
    out {1} [3]
    way {1} [3]
    hear {1} [3]
    say {1} [3]
    itself {1} [3]
    Oh {1} [3]
    dear {1} [3]