Sunday 2 October 2011

TOOLS FOR WEB CRAWLING
1.How does a web crawler works??
A web crawler is simply an automated program, or particularly a script that methodically scans or searches or in particular "crawls" the world wide web web in an automated or in a methodological manner by creating an index of the data that it is looking for.
Web crawlers start by parsing a specified web page , noting any hypertext links on that page that point to other pages.They then parse those pages for new links , and so on ,recursively.A web crawler needs a starting point which would be web address, URL.In order to browse the Internet we use the HTTP network protocol which allows us to talk to web servers an download or upload data from and to it.The crawler browses browses this URL and then seeks for hyper links.
A web crawlers doesn't actually move around the web or different computers retrieving the information instead of it resides on a single computer and sends HTTP requests for documents to other machines on the Internet , just as any other web browser does when user clicks on on the links . ALL the crawler really does to automate the process of following links.Web crawlers are bots that saves a copy of all the visited pages in a database(after processing them) so that information in the websites can be indexed to provide fast searches on the web.
SO there are basically three steps that are involved in the web crawling procedure.First, the search bot starts by crawling the pages of your site.Then it continues indexing the words and content of your site , and finally it visit the links(web page addresses or URL s) that are found in your site.
Web crawlers , some times called scrapers , automatically scan the Internet attempting to find the meaning of the content they find. The web wouldn't function without them. Crawlers are the backbone of the search engines(for eg in goggle the crawler is goggle bot) which combined with other algorithms work out the relevance of your page to a given key word context.
uses of web crawlers:
the web crawlers are not simply used by the search engines itself but also by the layman or linguistics uses them for performing a textual analysis; that is they comb the net to determine what words which are commonly used today ;market researches uses them to asses the trends in the given market etc..
(note : this not wholly a work of our, but this is a result of our frequent research on the net)
There are many free and easily available utilities for windows for crawling the web we have covered only sphinx(a very good browsers especially for advanced browsing)
SPHINX (A WINDOWS WEB CRAWLERS):
(it is one of the free and easily available utility on the net)
web sphinx or web site processors for html information extraction is a Java class library and interactive development environment for web crawlers .
It is basically designed for advanced web browsers who want to crawl a small part of the web (such as a single web site) automatically it is not made for crawling the entire it is essentially made for crawling only a small part of the web.
Web sphinx uses the built in Java- class URL and URL connection to fetch web pages.
1.USING THE WEB SPHINX FOR CRAWLING AND GIVING THE OUTPUT IN THE FORM OF A GRAPH SHOWING THE VARIOUS WEB PAGES CRAWLED.
crawling for google 
2.USING THE WEB SPHINX FOR CRAWLING goole.com and saving all the web pages crawled in a directory on the desktop (specified by the user).
3. The crawler also concatenates the pages visited.
the concatenated pages are concatenated into a single web page for printing..
4. EXTRACTING THE IMAGES FROM A SET OF PAGES..(the output if opened in windows explorer actually opens up an html code)..

 
 UBUNTU:
In Ubuntu wget is a very powerful tool which can be used to crawl web pages. Apart from crawling it can also be used to make a mirror of any website which can be viewed in the offline mode which is extremely useful for unstable network connections. It can also be used to download a file which is another powerful feature. Information about wget is easily available using the comments info wget and man wget. wget is also available for almost all platforms including windows, and is usually pre-installed on all linux/unix based distros.
The most useful option in wget is -r which recursively crawls the websites. Apart from wget harvestman is also a freely available software which was tried out by us. Harvestman is based on the principle of queues and it can also be used from the python prompt. ( refer to http://code.google.com/p/harvestman-crawler/wiki/WorldsSimplestCrawler)
In this project we have focussed on wget and we have made two shell scripts which perform the following tasks:
1) The first script asks the user for a website and crawls it , it then displays the link extensions which user wants
2) The second user asks the user for a song name and then downloads it in mp3 format
WGET1.SH:
#! /bin/bash
echo "Enter the website you want to crawl for links"
read site
echo "Enter the link you want match (eg. png)"
read link
wget $site -O xyz.txt
links=`cat xyz.txt | egrep -o "href=\"?[^<>\"\\;}]{3,}" | sed -e 's/href="//g'| egrep $link$`
for link in $links
do
echo $link
done
wget -r $site -O xyz.txt recursively crawls the website and stores the result it in xyz.txt
links=`cat xyz.txt | egrep -o "href=\"?[^<>\"\\;}]{3,}" | sed -e 's/href=//g' | sed -e 's/^"//g' | egrep $link$` command reads the file and extracts the links using egrep as all the links are stored after href (basic html) :) and the characters in the square brackets were obtained using trial and error, the sed commands are used to substitute href=" with a blank and finally egrep $links$ searches for links specified occuring at the end of the link.
Finally the for statement prints all the relevant links.

















WGET2.sh:

#! /bin/bash
echo "Please enter the song you want to download"
read song
song=`echo $song | sed -e 's/\s/%20/g' | awk '{print tolower($0)}'`
query="Index Of:mp3%20$song"
wget "http://www.google.co.in/search?q="$query -O results.txt --ser-agent="Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.2.3) Gecko/20100401 Fire fox/3.6.3 (FM Scene 4.6.1)"
link=`cat results.txt | egrep -o 'h3 class="r">





















wget "http://www.google.co.in/search?q="$query -O results.txt --user-agent="Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.2.3) Gecko/20100401 Fire fox/3.6.3 (FM Scene 4.6.1)"
In this command wget crawls the site and stores the information in results.txt. User agent is perhaps the most important option because this tells google that we are actually using mozilla firefox to search for the song (which we are not) and this is not an automated bot (which it is :)).
link=`cat results.txt | egrep -o 'h3 class="r">



















































NOTE:
Wget2.sh won't always work because of some sites using some trick to fake indexed pages should be enough.holding configurations like : allowing indexes, managing redirects etc, and is used by the apache web server. What happens in the background on those nasty sites is - it contains a redirect routine for mp3 files and sends you to another page.
PS: Please do not use this mp3 script excessively in IIIT. A similar scripts for downloading movies was made by Apoorv , Inshu and me(Tiru) and Apoorv was banned from using the net as we had exceeded the bandwidth limit (we all used his account to download movies, although he does not have any of them and all of those films reside with me and Inshu). 

EXPERIENCE:



it was a great felling working togther for the blog.
we learnt a lot from the project .
some of the links prefered :
google.com(our best friend) , yahoo answers,wikipedia
Countibuted by-
Tiru Sharma (2011116)
Kushagar lall(2011061)

1 comment:

  1. "Information about wget is easily available using the comments info wget and man wget. " no,no,no...

    "Information about wget is easily available using the COMMANDS: 'info wget' and 'man wget'."

    ReplyDelete