Scratch

FutureSightTech · 2012-11-05 05:58:32

I have been trying to write a website indexer in python, but I have had some trouble getting it to work. Please help!!!

def get_page(url,staypage):
onsite = url.find(staypage)
if onsite == -1:
return "page not on site"
import urllib
# Get a file-like object for the Python Web site's home page.
f = urllib.urlopen(url)
# Read from the object, storing the page's contents in 's'.
s = f.read()
f.close()
return s

def get_next_target(page):
start_link = page.find('<a href=')
if start_link == -1:
url = "none"
return url
start_quote = page.find('"', start_link)
end_quote = page.find('"', start_quote + 1)
url = page[start_quote + 1:end_quote]
if url not in links:
return url, end_quote
return none, end_quote

def union(p,q):
for e in q:
if e not in p:
p.append(e)

def get_all_links(page,targetpage):
global links
links = []
global relpathlink
relpathlink = 0
while True:
url,endpos = get_next_target(page)
print url
if url != None:
if url[0]=='/':
url= targetpage+url
url.find('=',relpathlink)
if relpathlink == -1:
url = "Error code 74318433. Relative path links are not capable of being mapped at this time."
if url not in sitelist:
sitelist.append(url + "(Error code 74318433. Relative path links are not capable of being mapped at this time.)")
print url
if url:
links.append(url)
page = page[endpos:]
else:
break
return links

def crawl_web(seed,targetpage):
global tocrawl
global crawled
global sitelist
sitelist = []
tocrawl = [seed]
crawled = []
while tocrawl:
print tocrawl
print sitelist
page = tocrawl.pop()
if page not in crawled:
union(tocrawl,get_all_links(get_page(page,targetpage),targetpage))
crawled.append(page)
sitelist.append(page)
return sitelist

print crawl_web('http://www.futuresight.org','http://www.futuresight.org')

Gravitation · 2012-11-05 06:02:19

I don't think you can use "while tocrawl". Why do't you try "for i in tocrawl" instead, and replace tocrawl in the loop with i?

roijac · 2012-11-05 08:00:47

What is the problem?

FutureSightTech · 2012-11-05 11:29:25

Gravitation wrote:
I don't think you can use "while tocrawl". Why do't you try "for i in tocrawl" instead, and replace tocrawl in the loop with i?

Whiletocrawl means that while tocrawl has a something listed, it keeps going. I doin't think for I in tocrawl would work. tocrawl.pop() gets me the last thing listed in the list tocrawl.

The problem is that the only link it is finding on a page is /. I have tried it on apple.come and futuresight.org, but with no results other than / each time. Please help.

blob8108 · 2012-11-05 15:34:26

You can use "while tocrawl". This just evaluates the "tocrawl" list as a boolean. Empty lists evaluate to false, true otherwise — so it just keeps going until there's nothing left in the list.

You're code's quite difficult to read, but I think your main problem is that the "return links" at the end of get_all_links() is indented inside the while loop. smile

FutureSightTech · 2012-11-07 11:47:21

Thanks for the help. I must have accidentally indented it while redoing my indentations. I fixed that, and few other problems, but I am getting the same result

How is my code difficult to read? is it the lack of syntax highlighting or how I coded it.

I also upgraded to python 3.3, so print is slightly different.

Code:
def get_page(url,staypage):
onsite = url.find(staypage)
if onsite == -1:
return "page not on site"
import urllib
# Get a file-like object for the Python Web site's home page.
f = urllib.urlopen(url)
# Read from the object, storing the page's contents in 's'.
s = f.read()
f.close()
return s

def get_next_target(page):
start_link = page.find('<a href=')
if start_link == -1:
url = "none"
return url, 0
start_quote = page.find('"', start_link)
end_quote = page.find('"', start_quote + 1)
url = page[start_quote + 1:end_quote]
if url not in links:
return url, end_quote
return none, end_quote

def union(p,q):
for e in q:
if e not in p:
p.append(e)

def get_all_links(page,targetpage):
global links
links = []
global relpathlink
relpathlink = 0
while True:
url,endpos = get_next_target(page)
if url != None:
if url[0]=='/':
url= targetpage+url
url.find('=',relpathlink)
if relpathlink == -1:
if url not in sitelist:
sitelist.append(url + "(Error code 74318433. Relative path links are not capable of being mapped at this time.)")
if url:
print(url)
links.append(url)
page = page[endpos:]
else:
break
return links

def crawl_web(seed,targetpage):
global tocrawl
global crawled
global sitelist
sitelist = []
tocrawl = [seed]
crawled = []
while tocrawl:
page = tocrawl.pop()
if page not in crawled:
union(tocrawl,get_all_links(get_page(page,targetpage),targetpage))
crawled.append(page)
sitelist.append(page)
return sitelist

print(crawl_web("http://www.futuresight.org","http://www.futuresight.org"))

blob8108 · 2012-11-07 14:32:50

FutureSightTech wrote:
How is my code difficult to read? is it the lack of syntax highlighting or how I coded it.

It looks too complicated tongue I feel it should be simpler, somehow...

Anyway, try changing get_next_target to read like so:

Code:

def get_next_target(page):    
    start_link = page.find('<a href=')
    if start_link == -1: 
        return None, None
    
    start_quote = page.find('"', start_link)
    end_quote = page.find('"', start_quote + 1)
    url = page[start_quote+1 : end_quote]
    
    if url not in links:
        return url, end_quote
    
    return None, end_quote

At the moment sometimes you're returning "none", which is a string containing the letters n+o+n+e, and sometimes the variable none (which doesn't exist!). You're then comparing the value to None (the line "if url != None:" in get_all_links) — this will only work if the value of url actually is None. (afaik). smile

blob8108 · 2012-11-07 14:38:59

blob8108 wrote:
FutureSightTech wrote:
How is my code difficult to read? is it the lack of syntax highlighting or how I coded it.
It looks too complicated I feel it should be simpler, somehow...

Sorry, on further thought — I'll expand on this:

For a start, I'd use BeautifulSoup to parse HTML: searching for <a> tags has various problems (what if the link starts with <a id="home-link" href="...">, for example?). You don't need to write your own union(): just use sets instead of lists, and then use their built-in union. Those are my first two thoughts — hope that helps! smile

FutureSightTech · 2012-11-07 15:07:15

I am not sure what beautiful soup or sets are, but I was taking an online course which had me make my first version like this. I will look into your functions, but I would rather use these for now.

I changed my get next target, but its still getting / as the only link on the page. perhaps the problem is with my page=page[endpos:]

blob8108 · 2012-11-07 15:21:08

It works for me... tongue

FutureSightTech · 2012-11-07 15:39:07

When you run it you get a list of all the pages on www.futuresight.org?

blob8108 · 2012-11-07 16:01:34

FutureSightTech wrote:
When you run it you get a list of all the pages on www.futuresight.org?

Yes. Did you make the tweak I suggested?

FutureSightTech · 2012-11-08 12:45:29

The one about get_next_target or the one about sets and beautiful soup?

blob8108 · 2012-11-08 15:13:13

FutureSightTech wrote:
The one about get_next_target or the one about sets and beautiful soup?

The one about get_next_target not actually returning None.

FutureSightTech · 2012-11-08 15:42:08

Can you post your code? We must have some difference.

Edit: I now have teh ability to edit posts so here is my latest code. it is only returning none right now.

def get_page(url,staypage):
onsite = url.find(staypage)
if onsite == -1:
return "page not on site"
import urllib
# Get a file-like object for the Python Web site's home page.
f = urllib.urlopen(url)
# Read from the object, storing the page's contents in 's'.
s = f.read()
f.close()
return s

def get_next_target(page):
url = "foo"
start_link = page.find('<a href=')
if start_link == -1:
print (url)
return None

start_quote = page.find('"', start_link)
end_quote = page.find('"', start_quote + 1)
url = page[start_quote+1 : end_quote]

if url not in links:
page = page[end_quote:]
print(url)
return url

return None

def union(p,q):
for e in q:
if e not in p:
p.append(e)

def get_all_links(page,targetpage):
global url
url = targetpage
global links
links = []
global relpathlink
while True:
url = get_next_target(page)
relpathlink = 0
if url != None:
if url[0]=='/':
url= targetpage+url
url.find('=',relpathlink)
if relpathlink == -1:
if url not in sitelist:
sitelist.append(url + "(Error code 74318433. Relative path links are not capable of being mapped at this time.)")
else:
return url
if url:
print(url)
links.append(url)

else:
break
return links

def crawl_web(seed,targetpage):
global tocrawl
global crawled
global sitelist
sitelist = []
tocrawl = [seed]
crawled = []
while tocrawl:
page = tocrawl.pop()
if page not in crawled:
union(tocrawl,get_all_links(get_page(page,targetpage),targetpage))
crawled.append(page)
sitelist.append(page)
return sitelist

print(get_all_links("http://www.futuresight.org","http://www.futuresight.org"))

Last edited by FutureSightTech (2012-11-09 06:13:30)

blob8108 · 2012-11-09 11:08:04

FutureSightTech wrote:
Can you post your code? We must have some difference.

I simply applied my suggested fix to the code you posted.

Edit: I now have teh ability to edit posts so here is my latest code. it is only returning none right now.

Presumably you meant `print(crawl_web("http://www.futuresight.org","http://www.futuresight.org")))` ?

Either way: on line 25ish, inside get_next_target, there's this line here:

page = page[end_quote:]

The reason this doesn't work is that you're expecting assigning to the "page" variable to also update the copy inside get_all_links, which it won't. The "page" variable passed as an argument to get_next_target is kinda a reference to the value in get_all_links. You get a value, not a copy of the variable, so reassigning to the local variable won't work.

Variables in Python are like identifiers or labels, not like boxes. (Apologies for the quality of that analogy.)

Ask if you need further clarification tongue

FutureSightTech · 2012-11-09 11:33:08

Thank you for all your help. I am running the code right now (I assume it will take a few minutes) If I have any problems I will contact you. once again, thanks.

blob8108 · 2012-11-09 11:58:17

Oh, good smile

Hardmath123 · 2012-11-09 12:23:49

blob8108 wrote:
Variables in Python are like identifiers or labels, not like boxes. (Apologies for the quality of that analogy.)

One of the things I never liked and never will like. Oh well.

blob8108 · 2012-11-09 12:34:51

Hardmath123 wrote:
blob8108 wrote:
Variables in Python are like identifiers or labels, not like boxes. (Apologies for the quality of that analogy.)
One of the things I never liked and never will like. Oh well.

Why? smile

Hardmath123 · 2012-11-09 12:52:46

I dunno, this seemed ugly:

Code:

>>> a = [1,2,3]
>>> b = a
>>> b[0] = 5
>>> a
[5,2,3]

I always thought making a reference should be explicit while copying should be default/implied (which is the opposite of coding today). Ideally references would be entirely new kinds of objects which when modified also mucked with the original object.

Also, it's weird how some data is copied and other data is referenced. Especially lists, which everyone needs to copy all the time.

Last edited by Hardmath123 (2012-11-09 13:02:29)

blob8108 · 2012-11-09 16:01:45

Hardmath123 wrote:
I dunno, this seemed ugly:
Code:
>>> a = [1,2,3]
>>> b = a
>>> b[0] = 5
>>> a
[5,2,3]
I always thought making a reference should be explicit while copying should be default/implied (which is the opposite of coding today). Ideally references would be entirely new kinds of objects which when modified also mucked with the original object.

Is there a language that doesn't work like this? I'm interested...

I do see your point — maybe I'm just used to usually getting references. What would the syntax for referencing look like?

Also, it's weird how some data is copied and other data is referenced. Especially lists, which everyone needs to copy all the time.

What data is "copied"? Are you thinking of eg. strings, which are immutable, vs. lists, which aren't?

Hardmath123 · 2012-11-09 22:55:57

blob8108 wrote:
Hardmath123 wrote:
I dunno, this seemed ugly:
Code:
>>> a = [1,2,3]
>>> b = a
>>> b[0] = 5
>>> a
[5,2,3]
I always thought making a reference should be explicit while copying should be default/implied (which is the opposite of coding today). Ideally references would be entirely new kinds of objects which when modified also mucked with the original object.
Is there a language that doesn't work like this? I'm interested...

I do see your point — maybe I'm just used to usually getting references. What would the syntax for referencing look like?

I really don't thank any language doesn't auto-reference lists, even BYOB does it. Which is strange. Maybe it's just a tradition carried down from when memory constraints prevented much copying-around of lists; today that's not a problem.

I suppose a syntax for referencing explicitly could look like this:

Code:

>>> a = [1,2,3]
>>> b = ref(a), a.ref(), maybe <a>
>>> c = a
>>> b[0] = 5
>>> c[0] = 1
>>> a
[5,2,3]

Maybe referencing should be function-specific, so you can specify in a function's argument whether you want to mess with the original or make a copy for your argument variable:

Code:

>>> def f(a):
...     a.append("hi")
...     return a
...
>>> x = [1,2]
>>> f(x)
[1,2,"hi"]
>>> x
[1,2,"hi"]

BUT

>>> def f(<a>):
...     a.append("hi")
...     return a
...
>>> x = [1,2]
>>> f(x)
[1,2,"hi"]
>>> x
[1,2]

Also, it's weird how some data is copied and other data is referenced. Especially lists, which everyone needs to copy all the time.
What data is "copied"? Are you thinking of eg. strings, which are immutable, vs. lists, which aren't?

Yes. But there shouldn't be a divide between strings and numbers in the first place, right? It's all data in the end, and you can have some useful string-mutating functions (replace, strip spaces, etc). I always thought os strings as a list of characters anyway, so it's hard to see a big difference. Numbers I can understand, because you can't change a number, only get a new one by operating on it. smile

Scratch

archived forums

#1 2012-11-05 05:58:32

Python website indexer help

#2 2012-11-05 06:02:19

Re: Python website indexer help

#3 2012-11-05 08:00:47

Re: Python website indexer help

#4 2012-11-05 11:29:25

Re: Python website indexer help

Gravitation wrote:

#5 2012-11-05 15:34:26

Re: Python website indexer help

#6 2012-11-07 11:47:21

Re: Python website indexer help

#7 2012-11-07 14:32:50

Re: Python website indexer help

FutureSightTech wrote:

Code:

#8 2012-11-07 14:38:59

Re: Python website indexer help

blob8108 wrote:

FutureSightTech wrote:

#9 2012-11-07 15:07:15

Re: Python website indexer help

#10 2012-11-07 15:21:08

Re: Python website indexer help

#11 2012-11-07 15:39:07

Re: Python website indexer help

#12 2012-11-07 16:01:34

Re: Python website indexer help

FutureSightTech wrote:

#13 2012-11-08 12:45:29

Re: Python website indexer help

#14 2012-11-08 15:13:13

Re: Python website indexer help

FutureSightTech wrote:

#15 2012-11-08 15:42:08

Re: Python website indexer help

#16 2012-11-09 11:08:04

Re: Python website indexer help

FutureSightTech wrote:

#17 2012-11-09 11:33:08

Re: Python website indexer help

#18 2012-11-09 11:58:17

Re: Python website indexer help

#19 2012-11-09 12:23:49

Re: Python website indexer help

blob8108 wrote:

#20 2012-11-09 12:34:51

Re: Python website indexer help

Hardmath123 wrote:

blob8108 wrote:

#21 2012-11-09 12:52:46

Re: Python website indexer help

Code:

#22 2012-11-09 16:01:45

Re: Python website indexer help

Hardmath123 wrote:

Code:

#23 2012-11-09 22:55:57

Re: Python website indexer help

blob8108 wrote:

Hardmath123 wrote:

Code:

Code:

Code:

Board footer