It's a common myth that any two people are separated at most by six degrees of separation. That is, if I know Alice, Alice knows Bob, Bob knows Charlie, Charlie knows Dylan, and Dylan knows Barack Obama, four degrees separate us. A want to try to test that for Scratch via the friend system. Using the APIs and my newly acquired Python knowledge, I want to find my friends, their friends, etc. for six degrees. Then I want to count the unique Scratchers found, and compare that to the number of Scratchers who have made any projects (spam accounts, won't be friends of anyone, will they?).
The hitch here is I predict it will take a lot of computation. Crunching Paddle2See's friends alone could take an hour or two, not to mention bandwidth. So as a corollary exercise, I want to try to set up a shared computing system where YOU can run a program which crunches some numbers for me, while still minimizing the number of friends counted (since there will be lots of repeats). The programs would ideally upload data to a shared server. I could then use JS/Quartz to generate an interactive web-like representation of friends in the community. I could see, for example, how many friends friend each other back, or whether there are any noticeable sub-communities (i.e. the regular ATers).
Also, I want to get a thumbs-up from the Scratch Team because I have a feeling this might be a burden on the servers.
Offline
I might run this a little bit.
Offline
Well, it's certainly a very interesting problem
I don't know how much computation you need; I'm not sure what algorithm one uses. I think this might be called the "all-pairs shortest path problem", from a little Wikipediaring: finding the shortest path between every pair of nodes in a graph, where the nodes are people and edges represent "friend" relationships. I suppose it's also a directed graph, (as I understand it) because you can add me as a friend without me "friending" you.
Somehow I imagine that you might end up downloading all the friend lists for each user. After all, if it's true that any two people are separated at most by six degrees of separation, finding your friends's friends to six degrees would include everyone!
I'd definitely ask the ST first... Or maybe just download the data very slowly, over a long period of time?
Hardmath123 wrote:
I want to try to set up a shared computing system...
Again, this depends on the algorithms you use and how they work — but I'm not sure it'll be very easy to parallelise.
The analysis sounds fascinating if you can do it, and a JS visualiser sounds cool!
Last edited by blob8108 (2012-09-08 11:42:07)
Offline
Thanks blob. My main concerns are parallelizing and making sure I don't crash Scratch. I'll definitely ask the Scratch team before I do it. You have a point about how I'll at some point download all friends lists. There are 354,694 Scratchers with projects, so...
Offline
Hardmath123 wrote:
There are 354,694 Scratchers with projects, so...
xD
How much of a hit is each API request, performance-wise? (download size, response time)
Offline
It's actually just a colon-separated list of user-ids I'm downloading, but the server takes a worryingly long time.
Offline
Interesting idea! Can you break it up so that you are only pulling data for a limited amount of people at a run? And maybe run it in the off- hours?
Offline
Paddle2See wrote:
Interesting idea! Can you break it up so that you are only pulling data for a limited amount of people at a run? And maybe run it in the off- hours?
Sure, I'm thinking about how to break up data downloads, and how to minimize it, and implementing small measures such that friends of a user are only downloaded once and only if the user had created projects.
Offline
Looks like the only information you really need for this is the users' IDs. We know how many users there are, so why not simply loop through that number and make that number of requests?
That's going to be about half to a million requests (we don't really know who uploaded projects), that you could possibly split up in batches of a few hundred requests and have a site keep a list of which batches have been run so far (with the clients asking the server what to download and then uploading the data).
Then you could crunch the numbers on your computer: that shouldn't be so bad!
Offline
But I'll have to do it for all Scratchers, even those whom haven't been friended by anyone else. It's simpler to go recursively, looking at my friends, then their friends, and so on. We will have to store the crunched Scratchers on a server so that all participating computers know what has been done. In the end, simply put, I want every Scratcher's friends list. I can then work with that offline.
A thing to note: there are over 1 million Scratchers. Let's say there are 1,500,000 to be conservative. Each has an average of say 30 friends. Each friend id is about 6 characters. So we have to in total download about 260MB of data.
EDIT: I take that back... maybe it is better to just brute-force through all Scratchers...
EDIT2: I'd like to get the collection done by the end of October. So let's try to get a basic shared computing system and all that stuff ready by September 20th and then get the actual downloading done over the next 2 months. That gives us 25,000 Scratchers per day, which implies under 1 Scratcher per second running 24x7. Obviously that's impractical for just me, so we should aim for 10 volunteers, so each volunteer has to get 100,000 Scratchers in 2 months, implying around two every 3 minutes, a much more realistic estimate. So, who'd like to volunteer?
Last edited by Hardmath123 (2012-09-09 06:35:10)
Offline
I'll volunteer, and I 'might' put it on my server that I run 24/7.
Offline
bobbybee wrote:
I'll volunteer, and I 'might' put it on my server that I run 24/7.
Thanks! I think all the server needs to do is accept a one-line message and write it to a text file which I can download later.
Offline
If you're using Python, definitely have a look at using:
* Twisted: great for networking! Seriously, you don't want to do stuff without it. Deferreds take a bit of getting used to.
* the multiprocessing module: lets you split tasks across multiple cores. Normal Python threads don't work properly (read about the GIL)
* Stackless Python.
Stackless is really cool — it lets you run hundreds/thousands of lightweight micro-threads, called "tasklets". It implements co-operative multitasking: periodically call stackless.schedule() from each tasklet to yield to the others. It has channels to communicate between tasklets. They're also pickleable, so I think you can pause a running tasklet (ie. a Python function), save it to a string/file, send it across the network, and then resume it on a different machine! (I'm sure I managed to make this last one work at least once...)
Stackless might be useful for designing the algorithm to perform the actual computation. You can do things like getting around Python's recursion limit of 1000, and you could have, say, a microthread for each Scratcher and use channels to communicate between them. Or something. It's fun, anyway!
And there are ways of making Twisted and Stackless play nice, too.
Anyway, have a look and see what you think
Last edited by blob8108 (2012-09-09 08:37:05)
Offline
blob8108 wrote:
If you're using Python, definitely have a look at using:
* Twisted: great for networking! Seriously, you don't want to do stuff without it. Deferreds take a bit of getting used to.
* the multiprocessing module: lets you split tasks across multiple cores. Normal Python threads don't work properly (read about the GIL)
* Stackless Python.
Stackless is really cool — it lets you run hundreds/thousands of lightweight micro-threads, called "tasklets". It implements co-operative multitasking: periodically call stackless.schedule() from each tasklet to yield to the others. It has channels to communicate between tasklets. They're also pickleable, so I think you can pause a running tasklet (ie. a Python function), save it to a string/file, send it across the network, and then resume it on a different machine! (I'm sure I managed to make this last one work at least once...)
Stackless might be useful for designing the algorithm to perform the actual computation. You can do things like getting around Python's recursion limit of 1000, and you could have, say, a microthread for each Scratcher and use channels to communicate between them. Or something. It's fun, anyway!
And there are ways of making Twisted and Stackless play nice, too.
Anyway, have a look and see what you think
That sounds cool. If there's one thing I never truly understood, it was how to make threading work right (JS is miserable with settimeout). Pickleable threads sound plain awesome.
I got the API-networking-thing working wonderfully with urllib (that's the right way to go, right?), and I can at this point get all the friends of a user, which was my intention all along. I'll most probably assign intervals of 100,000 or so to all volunteers. Maybe if I'm lucky, I can get Paddle to run it on MIT's servers and get it all at once as part of a nightly con job.
Do you think you could run some of the computation for me? Just leave it on overnight and post what you get in the morning.
Offline
Hardmath123 wrote:
I got the API-networking-thing working wonderfully with urllib (that's the right way to go, right?)
httplib2 is very good: it has various wonderful things like proper cookies and caching, as well as a nice interface. But if what you have works, then that sounds fine
However, if you're going to be doing other networking-y stuff in the same script, like communicating with a central server, and they need to happen at all simultaneously -- then definitely rewrite it in Twisted. It's only the HTTP request code that you'd need to change, and I can send you some boilerplate code for receiving the body content.
Maybe if I'm lucky, I can get Paddle to run it on MIT's servers and get it all at once as part of a nightly con job.
Heehee
Do you think you could run some of the computation for me? Just leave it on overnight and post what you get in the morning.
Probably -- assuming, of course, that using shared computing is right for your problem, which I'm still not convinced of. It'd still be pretty cool, either way
Offline
blob8108 wrote:
which I'm still not convinced of
...now you tell me...
What do you think is the answer to my problem?
Offline
Here's what I have:
import urllib def getUser(id): return urllib.urlopen("http://scratch.mit.edu/api/getusernamebyid/"+id).read()[3:] def getFriends(user): return map(getUser,urllib.urlopen("http://scratch.mit.edu/api/getfriendsbyusername/"+user).read()[3:].split(":")) getUser("1") getFriends("Hardmath123")
Offline
Hardmath123 wrote:
Here's what I have:
Just being picky ( ), but I find this more readable:
import urllib def get_user_by_id(id): url = "http://scratch.mit.edu/api/getusernamebyid/"+str(id) return urllib.urlopen(url).read().strip() def get_friends(user): url = "http://scratch.mit.edu/api/getfriendsbyusername/"+user friends = urllib.urlopen(url).read().strip().split(":") return map(get_user_by_id, friends) get_user_by_id(1) get_friends("Hardmath123")
You can probably just use strip() to get rid of the whitespace, since spaces aren't valid in usernames or ids.
It does take a worryingly long time -- is there an API method to get friends by user id (or friends as usernames?). At least you only have to lookup the username once for each user id...
Offline
Hardmath123 wrote:
blob8108 wrote:
which I'm still not convinced of
...now you tell me...
What do you think is the answer to my problem?
Well, as I said:
blob8108 wrote:
it depends on the algorithms you use and how they work — but I'm not sure it'll be very easy to parallelise.
I think you need to figure out what algorithm you're going to use to find the length of the path (friendship chain?) between each pair of Scratchers.
Collecting the data is a separate issue: are you more worried about the load on the server, or on your computer? Is it even worth parallelising?
Offline
I'm going to do all the analysis locally, it's just collection I want to distribute. To be honest, I'm worried about both the server and my computer. It's a tremendous computation. I'll look up th Apis for better friend detection, you're right about it being worryingly long.
Offline
A slightly less ambitious goal might be to graph the sub-network of every friend within 3-5 iterations, and display it in some way in order to get an insight into how connected certain micro-communities on scratch are.
for example - (me) DarthPickley, DrSuper, dapontes, mathjp, s_federici, ffred, & icampeao:
The chart of friend relationships is
* [row] has [col] on friends list
dar DrS dap mjp s_f ffr icam
dar x x x x x x x
DrS x - x x x x -
dap x x - x x x x
mjp - x x - x x -
s_f x x x - - x -
ffr x x x x x - x
icam x x - x x x -
as you can see, it is very close together.
Also, take into account the ranges of numbers of friends that people have. (You could use an assumption of about 75 friends per user and then assume that these might be random selections from some pool of active, moderately-known scratchers (paddle2see has almost 1300 friends, lets say 1300 users) then try those assumptions and see what you get as an answer...
Finally, I think it would be interesting to do a regression on users to see if there is any association between # of friends and # of projects.
Last edited by DarthPickley (2012-09-09 17:12:43)
Offline
Ask if you can download the friend database. Then you don't have to call their servers.
I was thinking about this months ago, but i also had the server load worry.
Offline
scimonster wrote:
Ask if you can download the friend database. Then you don't have to call their servers.
I was thinking about this months ago, but i also had the server load worry.
... how did I not think of that?!
Last edited by Hardmath123 (2012-09-10 12:19:14)
Offline
Hardmath123 wrote:
scimonster wrote:
Ask if you can download the friend database. Then you don't have to call their servers.
I was thinking about this months ago, but i also had the server load worry.... how did I not think of that?!
Well, I doubt they'd accept
What would really be nice is a PHP script to run on the Scratch servers that queried the MySQL databases directly without needing to go through the http server.
Of course, that's nearly as impossible as them giving you the friends databases. Which I frankly don't find very logical since they'd rather you risk slowing down their servers...
Offline