I have been trying to write a website indexer in python, but I have had some trouble getting it to work. Please help!!!
def get_page(url,staypage):
onsite = url.find(staypage)
if onsite == -1:
return "page not on site"
import urllib
# Get a file-like object for the Python Web site's home page.
f = urllib.urlopen(url)
# Read from the object, storing the page's contents in 's'.
s = f.read()
f.close()
return s
def get_next_target(page):
start_link = page.find('<a href=')
if start_link == -1:
url = "none"
return url
start_quote = page.find('"', start_link)
end_quote = page.find('"', start_quote + 1)
url = page[start_quote + 1:end_quote]
if url not in links:
return url, end_quote
return none, end_quote
def union(p,q):
for e in q:
if e not in p:
p.append(e)
def get_all_links(page,targetpage):
global links
links = []
global relpathlink
relpathlink = 0
while True:
url,endpos = get_next_target(page)
print url
if url != None:
if url[0]=='/':
url= targetpage+url
url.find('=',relpathlink)
if relpathlink == -1:
url = "Error code 74318433. Relative path links are not capable of being mapped at this time."
if url not in sitelist:
sitelist.append(url + "(Error code 74318433. Relative path links are not capable of being mapped at this time.)")
print url
if url:
links.append(url)
page = page[endpos:]
else:
break
return links
def crawl_web(seed,targetpage):
global tocrawl
global crawled
global sitelist
sitelist = []
tocrawl = [seed]
crawled = []
while tocrawl:
print tocrawl
print sitelist
page = tocrawl.pop()
if page not in crawled:
union(tocrawl,get_all_links(get_page(page,targetpage),targetpage))
crawled.append(page)
sitelist.append(page)
return sitelist
print crawl_web('http://www.futuresight.org','http://www.futuresight.org')
Offline
I don't think you can use "while tocrawl". Why do't you try "for i in tocrawl" instead, and replace tocrawl in the loop with i?
Offline
Gravitation wrote:
I don't think you can use "while tocrawl". Why do't you try "for i in tocrawl" instead, and replace tocrawl in the loop with i?
Whiletocrawl means that while tocrawl has a something listed, it keeps going. I doin't think for I in tocrawl would work. tocrawl.pop() gets me the last thing listed in the list tocrawl.
The problem is that the only link it is finding on a page is /. I have tried it on apple.come and futuresight.org, but with no results other than / each time. Please help.
Offline
You can use "while tocrawl". This just evaluates the "tocrawl" list as a boolean. Empty lists evaluate to false, true otherwise — so it just keeps going until there's nothing left in the list.
You're code's quite difficult to read, but I think your main problem is that the "return links" at the end of get_all_links() is indented inside the while loop.
Offline
Thanks for the help. I must have accidentally indented it while redoing my indentations. I fixed that, and few other problems, but I am getting the same result
How is my code difficult to read? is it the lack of syntax highlighting or how I coded it.
I also upgraded to python 3.3, so print is slightly different.
Code:
def get_page(url,staypage):
onsite = url.find(staypage)
if onsite == -1:
return "page not on site"
import urllib
# Get a file-like object for the Python Web site's home page.
f = urllib.urlopen(url)
# Read from the object, storing the page's contents in 's'.
s = f.read()
f.close()
return s
def get_next_target(page):
start_link = page.find('<a href=')
if start_link == -1:
url = "none"
return url, 0
start_quote = page.find('"', start_link)
end_quote = page.find('"', start_quote + 1)
url = page[start_quote + 1:end_quote]
if url not in links:
return url, end_quote
return none, end_quote
def union(p,q):
for e in q:
if e not in p:
p.append(e)
def get_all_links(page,targetpage):
global links
links = []
global relpathlink
relpathlink = 0
while True:
url,endpos = get_next_target(page)
if url != None:
if url[0]=='/':
url= targetpage+url
url.find('=',relpathlink)
if relpathlink == -1:
if url not in sitelist:
sitelist.append(url + "(Error code 74318433. Relative path links are not capable of being mapped at this time.)")
if url:
print(url)
links.append(url)
page = page[endpos:]
else:
break
return links
def crawl_web(seed,targetpage):
global tocrawl
global crawled
global sitelist
sitelist = []
tocrawl = [seed]
crawled = []
while tocrawl:
page = tocrawl.pop()
if page not in crawled:
union(tocrawl,get_all_links(get_page(page,targetpage),targetpage))
crawled.append(page)
sitelist.append(page)
return sitelist
print(crawl_web("http://www.futuresight.org","http://www.futuresight.org"))
Offline
FutureSightTech wrote:
How is my code difficult to read? is it the lack of syntax highlighting or how I coded it.
It looks too complicated I feel it should be simpler, somehow...
Anyway, try changing get_next_target to read like so:
def get_next_target(page): start_link = page.find('<a href=') if start_link == -1: return None, None start_quote = page.find('"', start_link) end_quote = page.find('"', start_quote + 1) url = page[start_quote+1 : end_quote] if url not in links: return url, end_quote return None, end_quote
At the moment sometimes you're returning "none", which is a string containing the letters n+o+n+e, and sometimes the variable none (which doesn't exist!). You're then comparing the value to None (the line "if url != None:" in get_all_links) — this will only work if the value of url actually is None. (afaik).
Offline
blob8108 wrote:
FutureSightTech wrote:
How is my code difficult to read? is it the lack of syntax highlighting or how I coded it.
It looks too complicated I feel it should be simpler, somehow...
Sorry, on further thought — I'll expand on this:
For a start, I'd use BeautifulSoup to parse HTML: searching for <a> tags has various problems (what if the link starts with <a id="home-link" href="...">, for example?). You don't need to write your own union(): just use sets instead of lists, and then use their built-in union. Those are my first two thoughts — hope that helps!
Offline
I am not sure what beautiful soup or sets are, but I was taking an online course which had me make my first version like this. I will look into your functions, but I would rather use these for now.
I changed my get next target, but its still getting / as the only link on the page. perhaps the problem is with my page=page[endpos:]
Offline
It works for me...
Offline
When you run it you get a list of all the pages on www.futuresight.org?
Offline
FutureSightTech wrote:
When you run it you get a list of all the pages on www.futuresight.org?
Yes. Did you make the tweak I suggested?
Offline
The one about get_next_target or the one about sets and beautiful soup?
Offline
FutureSightTech wrote:
The one about get_next_target or the one about sets and beautiful soup?
The one about get_next_target not actually returning None.
Offline
Can you post your code? We must have some difference.
Edit: I now have teh ability to edit posts so here is my latest code. it is only returning none right now.
def get_page(url,staypage):
onsite = url.find(staypage)
if onsite == -1:
return "page not on site"
import urllib
# Get a file-like object for the Python Web site's home page.
f = urllib.urlopen(url)
# Read from the object, storing the page's contents in 's'.
s = f.read()
f.close()
return s
def get_next_target(page):
url = "foo"
start_link = page.find('<a href=')
if start_link == -1:
print (url)
return None
start_quote = page.find('"', start_link)
end_quote = page.find('"', start_quote + 1)
url = page[start_quote+1 : end_quote]
if url not in links:
page = page[end_quote:]
print(url)
return url
return None
def union(p,q):
for e in q:
if e not in p:
p.append(e)
def get_all_links(page,targetpage):
global url
url = targetpage
global links
links = []
global relpathlink
while True:
url = get_next_target(page)
relpathlink = 0
if url != None:
if url[0]=='/':
url= targetpage+url
url.find('=',relpathlink)
if relpathlink == -1:
if url not in sitelist:
sitelist.append(url + "(Error code 74318433. Relative path links are not capable of being mapped at this time.)")
else:
return url
if url:
print(url)
links.append(url)
else:
break
return links
def crawl_web(seed,targetpage):
global tocrawl
global crawled
global sitelist
sitelist = []
tocrawl = [seed]
crawled = []
while tocrawl:
page = tocrawl.pop()
if page not in crawled:
union(tocrawl,get_all_links(get_page(page,targetpage),targetpage))
crawled.append(page)
sitelist.append(page)
return sitelist
print(get_all_links("http://www.futuresight.org","http://www.futuresight.org"))
Last edited by FutureSightTech (2012-11-09 06:13:30)
Offline
FutureSightTech wrote:
Can you post your code? We must have some difference.
I simply applied my suggested fix to the code you posted.
Edit: I now have teh ability to edit posts so here is my latest code. it is only returning none right now.
Presumably you meant `print(crawl_web("http://www.futuresight.org","http://www.futuresight.org")))` ?
Either way: on line 25ish, inside get_next_target, there's this line here:
page = page[end_quote:]
The reason this doesn't work is that you're expecting assigning to the "page" variable to also update the copy inside get_all_links, which it won't. The "page" variable passed as an argument to get_next_target is kinda a reference to the value in get_all_links. You get a value, not a copy of the variable, so reassigning to the local variable won't work.
Variables in Python are like identifiers or labels, not like boxes. (Apologies for the quality of that analogy.)
Ask if you need further clarification
Offline
Thank you for all your help. I am running the code right now (I assume it will take a few minutes) If I have any problems I will contact you. once again, thanks.
Offline
Offline
blob8108 wrote:
Variables in Python are like identifiers or labels, not like boxes. (Apologies for the quality of that analogy.)
One of the things I never liked and never will like. Oh well.
Offline
Hardmath123 wrote:
blob8108 wrote:
Variables in Python are like identifiers or labels, not like boxes. (Apologies for the quality of that analogy.)
One of the things I never liked and never will like. Oh well.
Why?
Offline
I dunno, this seemed ugly:
>>> a = [1,2,3] >>> b = a >>> b[0] = 5 >>> a [5,2,3]
I always thought making a reference should be explicit while copying should be default/implied (which is the opposite of coding today). Ideally references would be entirely new kinds of objects which when modified also mucked with the original object.
Also, it's weird how some data is copied and other data is referenced. Especially lists, which everyone needs to copy all the time.
Last edited by Hardmath123 (2012-11-09 13:02:29)
Offline
Hardmath123 wrote:
I dunno, this seemed ugly:
Code:
>>> a = [1,2,3] >>> b = a >>> b[0] = 5 >>> a [5,2,3]I always thought making a reference should be explicit while copying should be default/implied (which is the opposite of coding today). Ideally references would be entirely new kinds of objects which when modified also mucked with the original object.
Is there a language that doesn't work like this? I'm interested...
I do see your point — maybe I'm just used to usually getting references. What would the syntax for referencing look like?
Also, it's weird how some data is copied and other data is referenced. Especially lists, which everyone needs to copy all the time.
What data is "copied"? Are you thinking of eg. strings, which are immutable, vs. lists, which aren't?
Offline
blob8108 wrote:
Hardmath123 wrote:
I dunno, this seemed ugly:
Code:
>>> a = [1,2,3] >>> b = a >>> b[0] = 5 >>> a [5,2,3]I always thought making a reference should be explicit while copying should be default/implied (which is the opposite of coding today). Ideally references would be entirely new kinds of objects which when modified also mucked with the original object.
Is there a language that doesn't work like this? I'm interested...
I do see your point — maybe I'm just used to usually getting references. What would the syntax for referencing look like?
I really don't thank any language doesn't auto-reference lists, even BYOB does it. Which is strange. Maybe it's just a tradition carried down from when memory constraints prevented much copying-around of lists; today that's not a problem.
I suppose a syntax for referencing explicitly could look like this:
>>> a = [1,2,3] >>> b = ref(a), a.ref(), maybe <a> >>> c = a >>> b[0] = 5 >>> c[0] = 1 >>> a [5,2,3]
Maybe referencing should be function-specific, so you can specify in a function's argument whether you want to mess with the original or make a copy for your argument variable:
>>> def f(a): ... a.append("hi") ... return a ... >>> x = [1,2] >>> f(x) [1,2,"hi"] >>> x [1,2,"hi"] BUT >>> def f(<a>): ... a.append("hi") ... return a ... >>> x = [1,2] >>> f(x) [1,2,"hi"] >>> x [1,2]
Also, it's weird how some data is copied and other data is referenced. Especially lists, which everyone needs to copy all the time.
What data is "copied"? Are you thinking of eg. strings, which are immutable, vs. lists, which aren't?
Yes. But there shouldn't be a divide between strings and numbers in the first place, right? It's all data in the end, and you can have some useful string-mutating functions (replace, strip spaces, etc). I always thought os strings as a list of characters anyway, so it's hard to see a big difference. Numbers I can understand, because you can't change a number, only get a new one by operating on it.
Offline