This is a read-only archive of the old Scratch 1.x Forums.
Try searching the current Scratch discussion forums.

#1 2012-11-05 05:58:32

FutureSightTech
Scratcher
Registered: 2012-05-25
Posts: 35

Python website indexer help

I have been trying to write a website indexer in python, but I have had some trouble getting it to work. Please help!!!

def get_page(url,staypage):
    onsite = url.find(staypage)
    if onsite == -1:
        return "page not on site"
    import urllib
    # Get a file-like object for the Python Web site's home page.
    f = urllib.urlopen(url)
    # Read from the object, storing the page's contents in 's'.
    s = f.read()
    f.close()
    return s

def get_next_target(page):
    start_link = page.find('<a href=')
    if start_link == -1:
        url = "none"
        return url 
    start_quote = page.find('"', start_link)
    end_quote = page.find('"', start_quote + 1)
    url = page[start_quote + 1:end_quote]
    if url not in links:
        return url, end_quote
    return none, end_quote

def union(p,q):
    for e in q:
        if e not in p:
            p.append(e)

def get_all_links(page,targetpage):
    global links
    links = []
    global relpathlink
    relpathlink = 0
    while True:
        url,endpos = get_next_target(page)
        print url
        if url != None:
            if url[0]=='/':
                url= targetpage+url
            url.find('=',relpathlink)
            if relpathlink == -1:
                url = "Error code 74318433. Relative path links are not capable of being mapped at this time."
                if url not in sitelist:
                    sitelist.append(url + "(Error code 74318433. Relative path links are not capable of being mapped at this time.)")
        print url
        if url:
            links.append(url)
            page = page[endpos:]
        else:
            break
        return links

def crawl_web(seed,targetpage):
    global tocrawl
    global crawled
    global sitelist
    sitelist = []
    tocrawl = [seed]
    crawled = []
    while tocrawl:
        print tocrawl
        print sitelist
        page = tocrawl.pop()
        if page not in crawled:
            union(tocrawl,get_all_links(get_page(page,targetpage),targetpage))
            crawled.append(page)
            sitelist.append(page)
    return sitelist

print crawl_web('http://www.futuresight.org','http://www.futuresight.org')

Offline

 

#2 2012-11-05 06:02:19

Gravitation
New Scratcher
Registered: 2012-09-26
Posts: 500+

Re: Python website indexer help

I don't think you can use "while tocrawl". Why do't you try "for i in tocrawl" instead, and replace tocrawl in the loop with i?

Offline

 

#3 2012-11-05 08:00:47

roijac
Scratcher
Registered: 2010-01-19
Posts: 1000+

Re: Python website indexer help

What is the problem?

Offline

 

#4 2012-11-05 11:29:25

FutureSightTech
Scratcher
Registered: 2012-05-25
Posts: 35

Re: Python website indexer help

Gravitation wrote:

I don't think you can use "while tocrawl". Why do't you try "for i in tocrawl" instead, and replace tocrawl in the loop with i?

Whiletocrawl means that while tocrawl has a something listed, it keeps going. I doin't think for I in tocrawl would work. tocrawl.pop() gets me the last thing listed in the list tocrawl.

The problem is that the only link it is finding on a page is /. I have tried it on apple.come and futuresight.org, but with no results other than / each time. Please help.

Offline

 

#5 2012-11-05 15:34:26

blob8108
Scratcher
Registered: 2007-06-25
Posts: 1000+

Re: Python website indexer help

You can use "while tocrawl". This just evaluates the "tocrawl" list as a boolean. Empty lists evaluate to false, true otherwise — so it just keeps going until there's nothing left in the list.

You're code's quite difficult to read, but I think your main problem is that the "return links" at the end of get_all_links() is indented inside the while loop.  smile


Things I've made: kurt | scratchblocks2 | this cake

Offline

 

#6 2012-11-07 11:47:21

FutureSightTech
Scratcher
Registered: 2012-05-25
Posts: 35

Re: Python website indexer help

Thanks for the help. I must have accidentally indented it while redoing my indentations. I fixed that, and few other problems, but I am getting the same result

How is my code difficult to read? is it the lack of syntax highlighting or how I coded it.

I also upgraded to python 3.3, so print is slightly different.

Code:
def get_page(url,staypage):
    onsite = url.find(staypage)
    if onsite == -1:
        return "page not on site"
    import urllib
    # Get a file-like object for the Python Web site's home page.
    f = urllib.urlopen(url)
    # Read from the object, storing the page's contents in 's'.
    s = f.read()
    f.close()
    return s

def get_next_target(page):
    start_link = page.find('<a href=')
    if start_link == -1:
        url = "none"
        return url, 0 
    start_quote = page.find('"', start_link)
    end_quote = page.find('"', start_quote + 1)
    url = page[start_quote + 1:end_quote]
    if url not in links:
        return url, end_quote
    return none, end_quote

def union(p,q):
    for e in q:
        if e not in p:
            p.append(e)

def get_all_links(page,targetpage):
    global links
    links = []
    global relpathlink
    relpathlink = 0
    while True:
        url,endpos = get_next_target(page)
        if url != None:
            if url[0]=='/':
                url= targetpage+url
            url.find('=',relpathlink)
            if relpathlink == -1:
                if url not in sitelist:
                    sitelist.append(url + "(Error code 74318433. Relative path links are not capable of being mapped at this time.)")
        if url:
            print(url)
            links.append(url)
            page = page[endpos:]
        else:
            break
    return links

def crawl_web(seed,targetpage):
    global tocrawl
    global crawled
    global sitelist
    sitelist = []
    tocrawl = [seed]
    crawled = []
    while tocrawl:
        page = tocrawl.pop()
        if page not in crawled:
            union(tocrawl,get_all_links(get_page(page,targetpage),targetpage))
            crawled.append(page)
            sitelist.append(page)
    return sitelist
     
print(crawl_web("http://www.futuresight.org","http://www.futuresight.org"))

Offline

 

#7 2012-11-07 14:32:50

blob8108
Scratcher
Registered: 2007-06-25
Posts: 1000+

Re: Python website indexer help

FutureSightTech wrote:

How is my code difficult to read? is it the lack of syntax highlighting or how I coded it.

It looks too complicated  tongue  I feel it should be simpler, somehow...

Anyway, try changing get_next_target to read like so:

Code:

def get_next_target(page):    
    start_link = page.find('<a href=')
    if start_link == -1: 
        return None, None
    
    start_quote = page.find('"', start_link)
    end_quote = page.find('"', start_quote + 1)
    url = page[start_quote+1 : end_quote]
    
    if url not in links:
        return url, end_quote
    
    return None, end_quote

At the moment sometimes you're returning "none", which is a string containing the letters n+o+n+e, and sometimes the variable none (which doesn't exist!). You're then comparing the value to None (the line "if url != None:" in get_all_links) — this will only work if the value of url actually is None. (afaik).  smile


Things I've made: kurt | scratchblocks2 | this cake

Offline

 

#8 2012-11-07 14:38:59

blob8108
Scratcher
Registered: 2007-06-25
Posts: 1000+

Re: Python website indexer help

blob8108 wrote:

FutureSightTech wrote:

How is my code difficult to read? is it the lack of syntax highlighting or how I coded it.

It looks too complicated  tongue  I feel it should be simpler, somehow...

Sorry, on further thought — I'll expand on this:

For a start, I'd use BeautifulSoup to parse HTML: searching for <a> tags has various problems (what if the link starts with <a id="home-link" href="...">, for example?). You don't need to write your own union(): just use sets instead of lists, and then use their built-in union. Those are my first two thoughts — hope that helps!  smile


Things I've made: kurt | scratchblocks2 | this cake

Offline

 

#9 2012-11-07 15:07:15

FutureSightTech
Scratcher
Registered: 2012-05-25
Posts: 35

Re: Python website indexer help

I am not sure what beautiful soup or sets are, but I was taking an online course which had me make my first version like this. I will look into your functions, but I would rather use these for now.

I changed my get next target, but its still getting / as the only link on the page. perhaps the problem is with my page=page[endpos:]

Offline

 

#10 2012-11-07 15:21:08

blob8108
Scratcher
Registered: 2007-06-25
Posts: 1000+

Re: Python website indexer help

It works for me...  tongue


Things I've made: kurt | scratchblocks2 | this cake

Offline

 

#11 2012-11-07 15:39:07

FutureSightTech
Scratcher
Registered: 2012-05-25
Posts: 35

Re: Python website indexer help

When you run it you get a list of all the pages on www.futuresight.org?

Offline

 

#12 2012-11-07 16:01:34

blob8108
Scratcher
Registered: 2007-06-25
Posts: 1000+

Re: Python website indexer help

FutureSightTech wrote:

When you run it you get a list of all the pages on www.futuresight.org?

Yes. Did you make the tweak I suggested?


Things I've made: kurt | scratchblocks2 | this cake

Offline

 

#13 2012-11-08 12:45:29

FutureSightTech
Scratcher
Registered: 2012-05-25
Posts: 35

Re: Python website indexer help

The one about get_next_target or the one about sets and beautiful soup?

Offline

 

#14 2012-11-08 15:13:13

blob8108
Scratcher
Registered: 2007-06-25
Posts: 1000+

Re: Python website indexer help

FutureSightTech wrote:

The one about get_next_target or the one about sets and beautiful soup?

The one about get_next_target not actually returning None.


Things I've made: kurt | scratchblocks2 | this cake

Offline

 

#15 2012-11-08 15:42:08

FutureSightTech
Scratcher
Registered: 2012-05-25
Posts: 35

Re: Python website indexer help

Can you post your code? We must have some difference.


Edit: I now have teh ability to edit posts so here is my latest code. it is only returning none right now.

def get_page(url,staypage):
    onsite = url.find(staypage)
    if onsite == -1:
        return "page not on site"
    import urllib
    # Get a file-like object for the Python Web site's home page.
    f = urllib.urlopen(url)
    # Read from the object, storing the page's contents in 's'.
    s = f.read()
    f.close()
    return s
   
def get_next_target(page):
    url = "foo"
    start_link = page.find('<a href=')
    if start_link == -1:
        print (url)
        return None
       
    start_quote = page.find('"', start_link)
    end_quote = page.find('"', start_quote + 1)
    url = page[start_quote+1 : end_quote]

    if url not in links:
        page = page[end_quote:]
        print(url)
        return url
   
    return None

def union(p,q):
    for e in q:
        if e not in p:
            p.append(e)

def get_all_links(page,targetpage):
    global url
    url = targetpage
    global links
    links = []
    global relpathlink
    while True:
        url = get_next_target(page)
        relpathlink = 0
        if url != None:
            if url[0]=='/':
                url= targetpage+url
            url.find('=',relpathlink)
            if relpathlink == -1:
                if url not in sitelist:
                    sitelist.append(url + "(Error code 74318433. Relative path links are not capable of being mapped at this time.)")
        else:
            return url
        if url:
            print(url)
            links.append(url)
           
        else:
            break
    return links

def crawl_web(seed,targetpage):
    global tocrawl
    global crawled
    global sitelist
    sitelist = []
    tocrawl = [seed]
    crawled = []
    while tocrawl:
        page = tocrawl.pop()
        if page not in crawled:
            union(tocrawl,get_all_links(get_page(page,targetpage),targetpage))
            crawled.append(page)
            sitelist.append(page)
    return sitelist
     
print(get_all_links("http://www.futuresight.org","http://www.futuresight.org"))

Last edited by FutureSightTech (2012-11-09 06:13:30)

Offline

 

#16 2012-11-09 11:08:04

blob8108
Scratcher
Registered: 2007-06-25
Posts: 1000+

Re: Python website indexer help

FutureSightTech wrote:

Can you post your code? We must have some difference.

I simply applied my suggested fix to the code you posted.

Edit: I now have teh ability to edit posts so here is my latest code. it is only returning none right now.

Presumably you meant `print(crawl_web("http://www.futuresight.org","http://www.futuresight.org")))` ?

Either way: on line 25ish, inside get_next_target, there's this line here:

    page = page[end_quote:]

The reason this doesn't work is that you're expecting assigning to the "page" variable to also update the copy inside get_all_links, which it won't. The "page" variable passed as an argument to get_next_target is kinda a reference to the value in get_all_links. You get a value, not a copy of the variable, so reassigning to the local variable won't work.

Variables in Python are like identifiers or labels, not like boxes. (Apologies for the quality of that analogy.)

Ask if you need further clarification  tongue


Things I've made: kurt | scratchblocks2 | this cake

Offline

 

#17 2012-11-09 11:33:08

FutureSightTech
Scratcher
Registered: 2012-05-25
Posts: 35

Re: Python website indexer help

Thank you for all your help. I am running the code right now (I assume it will take a few minutes) If I have any problems I will contact you. once again, thanks.

Offline

 

#18 2012-11-09 11:58:17

blob8108
Scratcher
Registered: 2007-06-25
Posts: 1000+

Re: Python website indexer help

Oh, good  smile


Things I've made: kurt | scratchblocks2 | this cake

Offline

 

#19 2012-11-09 12:23:49

Hardmath123
Scratcher
Registered: 2010-02-19
Posts: 1000+

Re: Python website indexer help

blob8108 wrote:

Variables in Python are like identifiers or labels, not like boxes. (Apologies for the quality of that analogy.)

One of the things I never liked and never will like. Oh well.


Hardmaths-MacBook-Pro:~ Hardmath$ sudo make $(whoami) a sandwich

Offline

 

#20 2012-11-09 12:34:51

blob8108
Scratcher
Registered: 2007-06-25
Posts: 1000+

Re: Python website indexer help

Hardmath123 wrote:

blob8108 wrote:

Variables in Python are like identifiers or labels, not like boxes. (Apologies for the quality of that analogy.)

One of the things I never liked and never will like. Oh well.

Why?  smile


Things I've made: kurt | scratchblocks2 | this cake

Offline

 

#21 2012-11-09 12:52:46

Hardmath123
Scratcher
Registered: 2010-02-19
Posts: 1000+

Re: Python website indexer help

I dunno, this seemed ugly:

Code:

>>> a = [1,2,3]
>>> b = a
>>> b[0] = 5
>>> a
[5,2,3]

I always thought making a reference should be explicit while copying should be default/implied (which is the opposite of coding today). Ideally references would be entirely new kinds of objects which when modified also mucked with the original object.

Also, it's weird how some data is copied and other data is referenced. Especially lists, which everyone needs to copy all the time.

Last edited by Hardmath123 (2012-11-09 13:02:29)


Hardmaths-MacBook-Pro:~ Hardmath$ sudo make $(whoami) a sandwich

Offline

 

#22 2012-11-09 16:01:45

blob8108
Scratcher
Registered: 2007-06-25
Posts: 1000+

Re: Python website indexer help

Hardmath123 wrote:

I dunno, this seemed ugly:

Code:

>>> a = [1,2,3]
>>> b = a
>>> b[0] = 5
>>> a
[5,2,3]

I always thought making a reference should be explicit while copying should be default/implied (which is the opposite of coding today). Ideally references would be entirely new kinds of objects which when modified also mucked with the original object.

Is there a language that doesn't work like this? I'm interested...

I do see your point — maybe I'm just used to usually getting references. What would the syntax for referencing look like?

Also, it's weird how some data is copied and other data is referenced. Especially lists, which everyone needs to copy all the time.

What data is "copied"? Are you thinking of eg. strings, which are immutable, vs. lists, which aren't?


Things I've made: kurt | scratchblocks2 | this cake

Offline

 

#23 2012-11-09 22:55:57

Hardmath123
Scratcher
Registered: 2010-02-19
Posts: 1000+

Re: Python website indexer help

blob8108 wrote:

Hardmath123 wrote:

I dunno, this seemed ugly:

Code:

>>> a = [1,2,3]
>>> b = a
>>> b[0] = 5
>>> a
[5,2,3]

I always thought making a reference should be explicit while copying should be default/implied (which is the opposite of coding today). Ideally references would be entirely new kinds of objects which when modified also mucked with the original object.

Is there a language that doesn't work like this? I'm interested...

I do see your point — maybe I'm just used to usually getting references. What would the syntax for referencing look like?

I really don't thank any language doesn't auto-reference lists, even BYOB does it. Which is strange. Maybe it's just a tradition carried down from when memory constraints prevented much copying-around of lists; today that's not a problem.

I suppose a syntax for referencing explicitly could look like this:

Code:

>>> a = [1,2,3]
>>> b = ref(a), a.ref(), maybe <a>
>>> c = a
>>> b[0] = 5
>>> c[0] = 1
>>> a
[5,2,3]

Maybe referencing should be function-specific, so you can specify in a function's argument whether you want to mess with the original or make a copy for your argument variable:

Code:

>>> def f(a):
...     a.append("hi")
...     return a
...
>>> x = [1,2]
>>> f(x)
[1,2,"hi"]
>>> x
[1,2,"hi"]

BUT

>>> def f(<a>):
...     a.append("hi")
...     return a
...
>>> x = [1,2]
>>> f(x)
[1,2,"hi"]
>>> x
[1,2]

Also, it's weird how some data is copied and other data is referenced. Especially lists, which everyone needs to copy all the time.

What data is "copied"? Are you thinking of eg. strings, which are immutable, vs. lists, which aren't?

Yes. But there shouldn't be a divide between strings and numbers in the first place, right? It's all data in the end, and you can have some useful string-mutating functions (replace, strip spaces, etc). I always thought os strings as a list of characters anyway, so it's hard to see a big difference. Numbers I can understand, because you can't change a number, only get a new one by operating on it.  smile


Hardmaths-MacBook-Pro:~ Hardmath$ sudo make $(whoami) a sandwich

Offline

 

Board footer