I need to write a script that insert 1-million records of username or emails by crawling the web, into database. The script may be any types like python,ruby,php etc.
Please let me know is it possible ?if possible please provide the information how can I build the script.
Thanks
Its possible may take some time though depending on your machine's performance and your internet connection.
You could use PHP's cURL library to automatically send Web requests and then you could easily parse the data using a library for example :simplHtmlDOM or using native PHP DOM. But beware of running out of memory, also I highly recommend running the script from shell rather than a web browser. Also consider using multi curl functions, to fasten the process.
This is extreamly easy and fast to implement, although multi-threading would give a huge performance boost in this scenario, so I suggest using one of the other languages you proposed. I know you could do this in Java easily using Apache HttpClient library and manipulate the DOM and extract data using native x-path support, regex or use one of the many third party dom implementations in Java.
I strongly recommend also checking out Java library HtmlUnit, where it could make your life much easier, but you could maybe take a performance hit for that. A good multi-threading implementation would give a huge performance boost but a bad one could make your program run worse.
Here is some resources for python:
http://docs.python.org/library/httplib.html
http://www.boddie.org.uk/python/HTML.html
http://www.tutorialspoint.com/python/python_multithreading.htm