Search code examples
javagoogle-searchgoogle-search-apiurl-parsing

Retrieving word definitions from google java


I have a list of words (1K+) in a file, and I would like to get their definitions and save them. I was thinking about getting their definitions from Google, as it's the first thing that it shows. The way I thought about doing that is quite rudimental, which is to create a URL instance pointing to the Goole search of the given word, and read the content using streams. Then, "filter" the definition, which is always in between "data-dobid="dfn"><.span>" and "<./span>"

For example:

[...]data-dobid="dfn"><.span>. unwilling or refusing to change one's views or to agree about something<./span>.[...]

Which is the definition of intransigent

However I would like to know if there is a more "efficient" way of doing so, for example without retrieving all the other results of the search. And also, If it's possible to load multiple results in a background thread so that when I want to "decode" a definition and save it, I don't always have to be waiting for the search to be completed.


Solution

  • The more efficient approach is to download a dictionary which you can then load locally. This gives you a local file or database that is readily searchable.

    This approach is not only computationally efficient but it also will ensure you're are using the information correctly under its license. What you are proposing is commonly called "scraping" and may go against various licenses and terms of service.

    This blog post lists several freely available and freely licensed dictionaries.

    This AskUbuntu.SE question describes some more of the technical work required to acquire a free dictionary and reference it from the command line. You would want to replicate these reading patterns to load the data in Java.

    Yet another approach would be to use a freely available and appropriately licensed API such as https://dictionaryapi.com/ . This would still use HTTP calls but is clearly licensed and is also an explicit API for looking up human-language word defintions. This is an advantage over scraping Google because you won't have to parse HTML and it is appropriately licensed for you to use it.

    Finally there are some similar, if not duplicate, questions on StackOverflow and StackExchange such as this one: How to implement an English dictionary in Java?