Search code examples
javapythonxmlxpathimport.io

Extracting information using XPaths


Good afternoon dear community,

I have finally compiled a list of working XPaths required to scrape all of the information from URL's that i need.

I would like to ask for your suggestion, for a newbie in coding what is the best way to scrape around 50k links using only XPaths (around 100 xpaths for each link)?

Import.io is my best tool at the moment, or even SEO tools for Excel, but they both have their own limitations. Import io is expensive, SEO tools for excel isn't suited to extract more than 1000 links.

I am willing to learn the system suggested, but please suggest a good way of scraping for my project!

#

SOLVED! SEO Tools crawler is actually super usefull and I believe I've found what i need. I guess i'll hold off Python or Java until i encounter another tough obstacle. Thank you all!


Solution

  • That strongly depends on what you mean by "scraping information". What exactly do you want to mine from the websites? All major languages (certainly Java and Python that you mentioned) have good solutions for connecting to websites, reading content, parsing HTML using a DOM and using XPath to extract certain fragments. For example, Java has JTidy, which allows you to parse even "dirty" HTML from websites into a DOM and manipulate it somewhat. However, the tools needed will depend on the exact data processing needs of your project.