Python script to get links from yahoo search

This was a quick script I made to pull links from yahoo search using the boss search api, and then list the unique domains.

If you want the entire links, just modify so that the whole links are appended to the list. Yahoo does not allow to get all the results, but only a certain predefined number so this code only extracts about 800 domains. But it is still good enough for a start and for most uses.

I am also working on getting citation values for google scholar for a friend. I will post that soon here. Heres the code for now.

#! /usr/bin/python
import urllib,json
from urlparse import urlparse

yahoo_application_id="Ht18VqTV34EMRWTJKOOh4rNBWTqkrjTSSQj9JwWlsqTMK41_3oFWFnhivJipX0wnvU4qzXc9VAw-"
nextresult=0;
links=list()
linksdump=list()
#print yahoo_application_id

#print "http://boss.yahooapis.com/ysearch/web/v1/Jeba+Singh+Emmanuel?appid="+yahoo_application_id+"&format=xml"
while(True):
	print "trying result from " + str(nextresult)
	f = urllib.urlopen("http://boss.yahooapis.com/ysearch/web/v1/search+engine+optimization+software?appid="+yahoo_application_id+"&format=json&count=100&start="+str(nextresult))
	ss=json.JSONDecoder()
	ssjson= ss.decode(f.read())
	#count=ssjson["ysearchresponse"]["count"]
	#start=ssjson["ysearchresponse"]["start"]
	totalhits=int(ssjson["ysearchresponse"]["totalhits"])
	print totalhits
	for x in ssjson["ysearchresponse"]["resultset_web"]:
		url= x["url"]
		o = urlparse(url)
		linksdump.append(url)
		link = o[0]+"://"+o[1]
		if link not in links:
			links.append(link)
		nextresult=nextresult+1
	if (nextresult>10000):
		break
print "Obtained results: " + str(nextresult) + " of which " + str(len(links)) + " were unique."
for x in links:
	print x

Cool huh? If you want any help modifying this, drop me a line.

One thought on “Python script to get links from yahoo search

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>