IR 9.0 Official tag: ir9

October 6th, 2008

The Internet Research 9.0 conference begins in Copenhagen next week. For those attending (or interested in following from afar), please use the tag “ir9″ everywhere tagging is allowed: Flickr, Del.icio.us, your blog, walls, people, etc.



Search Engine Society on Amazon

October 4th, 2008

Via Twitter:

cshirky: Twitter office pool: How many books will come out in 2008 with ‘Google’ in the title? Round your answer to nearest dozen…

Well, not mine :). It’s on Amazon, now, with book preview, which means it must be real. However, the release date listed by Amazon is December 31. If you were planning on lining up late, don’t worry, I think that means that it will come out the midnight before New Year’s Eve, so you don’t have to cancel any New Years Eve plans.

Drat, I thought it would be out sooner. It’s certainly not obsolete (yet), but the facts on the ground (as Rumsfeld might say) are moving quickly. I think it may come out earlier in the UK (?).



Redirect Loop in Wordpress

October 4th, 2008

In the “for what it’s worth” department…

I’ve been periodically getting a “Redirect Loop” error in Wordpress. I wasn’t sure what the source was. I tried the things you are supposed to do (deleting cookies, especially), and no luck. I went in via FTP and temporarily swapped all my plugins out of the plugins directory. This let me log in, which suggests that one of the many plugins is the culprit. No, I don’t know which one. I’ll try adding them in again one-by-one to track it down.

For now, if you are having a frustrating Redirect Loop problem, look to the plugins. That is all.



[the making of, pt. 6] Are you experienced?

October 2nd, 2008

This is the sixth in a series of posts about the piece of research I am doing on Digg. You can read it from the beginning if you are interested. In the last section I showed a correlation between how much of a response people got from their comments and their propensity to contribute future comments to the community. In this section, I question whether we can observe some form of “learning” or “training” over time among Digg commenters. Do they figure out how to garner more Diggs, either by learning on an individual basis, or by attrition?

Are later comments better Dugg?

You will recall that we have numbered the comments for each user in our sample from the earliest to the most recent. If people are learning to become more acceptable to the community, we should see a significant difference in responses (Diggs and replies) between people’s first posts and their 25th posts.

Loading all the data into R, I find a fairly strong correlation between post number and upward diggs (.28), as well as with downward diggs (.11), and replies (.08). I’d like to show this as a boxplot, so you can clearly see the growing abilities of users, but R is giving me problems. The issue is simple enough: although I can “turn off” the plotting of outliers outside of the boxes, it still makes the size of the chart based on these. Since one of the comments received over 1,500 diggs up, it means my boxes (which have median values in the twos and threes) are sitting at the bottom of the graph as little lines. After a little digging in the help file, I figure out how to assign limits to the y axis (with a ylim=c(0,10)), and I generate the figure seen to the right.

But this raises the question of what creates this increase. Like failing public high schools, some of the rise in positive marks might just be because the less capable Diggers are dropping out. We need to figure out if this is messing with our results.

Dropping out the Dropouts

In order to filter out the dropouts, I turn to… nope, not Python this time. I could, but it’s just as easy to sort all the comments in Excel by order, so that all the 30th comments are in one place on the worksheet. I then copy and paste these 812 usernames to a new sheet. In the last column of the main sheet, I write up a function that says, if the username is on that list, and if the number of this comment is 30th or less, print a 1 in this column; otherwise, print a 0. If you are curious what that function looks like precisely, it’s this:

=IF(I178345<30,IF(ISNA(VLOOKUP(D178345,have30!$A$1:$B$812,1,FALSE)),"",1),"")

I can now sort the list by this new column, and I have all the first 30 comments, by users who have made at least 30 comments, in one place. I pull these into R and rerun the correlations. It turns out that–no surprise–they are reduced. The correlations to buries and responses are near zero, and to diggs are at 0.19.

I’m actually pretty happy with a 0.19 correlation. It means that there is something going on. But I’m curious as to what reviewers will think. The idea of a strong correlation is a bit arbitrary: it depends on what you are doing. If I designed a program that, over a six month period, correlated at -0.19 with body weight, or crime rates, or whatever, it would be really important. The open question is whether there are other stable factors that can explain this, or if the rest of the variability is due to, say, the fact that humans are not ants and tend to do unpredictable stuff now and again. Obviously, this cries out for some form of factor analysis, but I’m not sure how many of the other factors are measurable, or what they might be.

Hidden in these numbers, I suspected, were trolls: experienced users who were seeking out the dark side, learning to be more and more execrable during their first 30 comments. I wanted to get at the average scores of these folks, so I used the “subtotal” function in Excel (which can give you “subaverages” as well), and did some copying, pasting, and sorting to be able to identify the extreme ends. The average average was a score of about 3. The most “successful” poster managed to get an average score of over 33. She had started out with a bit of a bumpy ride. In fact, the first 24 posts had an average score of less than zero. But she cracked the code by the 25th comment, and received scores consistently in the hundreds for the last five of this chunk of data.

On the other end was someone who had an average score of -11. Among the first thirty entries, only one rose above zero, and the rest got progressively worse ratings, employing a litany of racist and sexist slurs, along with attacks on other sacred cows on Digg. It may have been she was just after the negative attention, and not paying any mind to the quantification of that in the form of a Digg score, but it’s clear that the objective was not to fit in.

Enough with the numbers!

I wanted to balance out the questions of timing and learning with at least an initial look at content. I always like to use mixed methods, even though it tends to make things harder to publish. At some point I really need to learn the lesson of Least Publishable Units, and split my work into multiple papers, but I’m not disciplined enough to do that yet. So, in the next sections I take on the question of what kinds of content seem to affect ratings.



[the making of, pt. 5] Rat in a cage

September 29th, 2008

[This is the fifth in a series of posts about a piece of research I am doing on Digg. If you like, you can start at the beginning. At this point, we have the data, and are manipulating it in various ways to try to show some relationships.]

Extracting some variables

One of the things we want to see is whether diggs (either up or down) affect the amount of time it takes to comment again. We can also look at the case of the last comment: does a bad rating make it more likely to quit? The two of these are related.

We have a hint from an earlier study uncovered in the lit review. Lampe & Johnson suggest that on Slashdot, any link (down or up) was likely to encourage participation, at least for newcomers. Persistent bad ratings seemed to drum people out. So, we want to see whether there is a relationship between the ratings for a comment, and the latency before the next comment is posted.

We have some of the variables we need. We have the number of up and down Diggs. Although we don’t have the total rating, that’s easy enough to derive–heck we can do it in our stats program or Excel if we want to. We also have replies, and because we have it, we’re going to look at it. If it turns out there is a relationship worth reporting there, we can go back and include it in the report. Wasn’t really part of the initial plan, but it’s there, and a relationship seems plausible, so we should check it.

The main thing that isn’t there is a clear amount of time between the post under consideration and the subsequent post. As the last section suggested, we have a lot of people who posted nothing, or only had a single post, but for those that had multiple posts, we need to (surprised) write a script that will run through and figure out latencies.

This is actually the most complex script so far. It needs to find all the comments posted by a given user, then put them in chronological order. It then needs to find the difference in time between each pair of comments. In each case, Digg provides the time of the comment in Unix format–that is, seconds since January 1, 1970. So, we can generate the difference in seconds. Obviously, if we don’t have a post following a given post, we can’t find such a latency. In that case, we fill that slot with a -1 to indicate that this was a “final comment” for the user. That may mean the user has quit, or simply that she hasn’t posted a comment again before our period of collection.

Also, for reasons that will become clear later on, we store an indicator of the order of the comments. It will make it easier to find the first, fifth, or twelfth post when we want to later on.

Is there a correlation worth looking at?

Our first step here is to look to see if there is an obvious correlation in a scatterplot of some of the variables. Why bother with a scatterplot? It would be convenient to make use of Pearson’s r to see whether there is a significant correlation, but it assumes a normal distribution of the variables. It was pretty obvious that this was a non-parametric distribution from the outset (ranking posts, etc.) and so I knew I would be using non-parametric tests (MWW and Spearman’s ρ), but it’s helpful to get a handle on the data.

I didn’t want to look at all the cases: some of the posts were mere seconds after one another (something strange there), so I tossed it into Excel, sorted by the latency, and chopped off anything shorter than 5 minutes. From there, I could just copy and paste items into R as I needed to figure out whether there were some relationships.

I was disappointed to find that the correlations between latency and diggs were fairly weak. It turns out that if you get more diggs, you may return to post a little bit faster, but not much. When you look only at a comparison of posts that had some diggs (including being dugg down!) with those that had none, there is a fairly significant gap. The standard deviation is also extremely high, but with over a hundred thousand cases, we can still say with confidence that there is a difference in averages.

I also took a look at the comments that were “mid stream” in a users Digg career, as compared to those that had no following comment. Now, the trailing comment might just mean that (like me!) they have taken a break from Digg for a while–not that they have quit. But they also include those who posted once or twice and gave up. Here, the differences were even more stark: any sort of feedback increases the likelihood of people coming back.

Note that I’ve just committed the cardinal sin of correlation, above, and it’s easy to commit. It may be that it isn’t that lack of feedback causes attrition, but rather that those who aren’t very into Digg don’t produce content that gets a lot of feedback. In either case, we can say with some certainty that low Diggs tend to go along with less frequent participation by the commenter in the future.

Coming up

In the next segment, I try to see whether experience actually plays a role in how many Diggs you get.



Three QU students arrested. Chronicle?

September 27th, 2008

Three undergrads have been arrested with a drug stash in their dorm. Given the trouble our campus has had with drinking, you think they might actually encourage something a bit less corrosive. (I’m only half kidding–security turned a fairly blind eye to marijuana use by students at some of the west coast universities I know.)

So, the independent paper that the administration has declared part of the Axis of Evil has a story on the bust. Still waiting on the administration-backed paper, The Chronicle. I hope the delay in publishing is simply because they are using their direct access to the campus to do some hard-hitting investigative reporting.



[the making of, pt. 4] Basic descriptions of the sample

September 26th, 2008

This is the fourth in a series of posts about a paper I am writing, breaking down the process old-school. It started here. So, in part 3, I talked about how I got the sample of the users (and waived my hands a bit about the sample of the comments). Now, I want to tell my audience (and know myself) the basic structure of the sample.

Counting it up

I can say, for example, how many I collected (30,000), and what the oldest of these accounts is (December of 2004) and what the newest accounts are (cut off at May of 2008). I also want to say something about how many of these post comments, and how many comments I have.

The latter is pretty easy: I dump my comma-delimited file of comments into a plain text editor. I use Notepad2 for this sort of thing, because it has line numbering (making my job easier), and doesn’t–like the original Notepad–crash a Windows system when you try to open very large files. In total, 197,658 comments.

Distribution

So, on average, that’s a lot of comments. But we know that it’s unlikely many people post the “average” number of comments, or even that it is distributed normally around that average. Far more likely is that you have a large number of people who post never or infrequently, and a handful of freaks enthusiasts posting every two minutes or so. What we need to do is count up the comments by user.

So we turn to Python again, and write up a quick script that goes through the 197K comments and counts up how many each user makes. In practice, the program doesn’t find all 30,000 users in our sample, because 23,532 have not posted a comment. The result is a comma delimited file with the user name and number of comments. Now we can construct a histogram.

Histogram

I am a big fan of Excel, and we could use it to create the histogram, but I always seem to spend about 15 minutes figuring out how to do histograms in Excel, relearning it each time. The obvious choice is SPSS, but for a change of pace, I’m going to use a free piece of mathematics software called R.

The reason is simple enough. A quick run through a regular histogram shows that this is a heavily “powered” Pareto distribution. When I plot it as a regular histogram, it comes out as two lines along the axes, and I tiny curve at the origin. One person actually made 6,598 comments, and I had to check the site to make sure there hadn’t been an error. Another posted over 4,000 comments.

So, what we need is a log-log histogram. Although I’m sure there is a function that will do this for me neatly (and I have to admit ignorance when it comes to doing this in SPSS, but I suspect it’s just a matter of checking a box), I’m once again going to turn to Python to write a script that comes up with frequencies (i.e., how many people posted once, twice, … ntimes). I could “bin” these frequencies and come up with something lat looks like a regular histogram, but since folks are not as used to seeing log-log bar charts, I decided to do it without the bins. The resulting file is just a number on each line, starting with the number of people with one comment, the next line is the number of users with 2 comments, and so on. I drop this file into Notepad2 to take a look, and (CTRL-C) copy all the data.

I open up R, and first execute this command:

x <- type.convert(readClipboard())

This loads all of the data I just copied into a “vector” called x. If you are unfamiliar with the format of R commands, note that the <- is an assignment symbol: it says put the stuff on the right into the box on the left. The readClipboard function–shockingly–reads whatever is on the Windows clipboard. Type.convert converts strings into integers, since the clipboard just assumes whatever you are copying is a string (or character) rather than a number. Now we have all this stuff in the vector x.

Next, I issue the following command:

plot(x, log="xy", xlab="log(number of comments)", ylab="log(number of users)")

which produces the plot shown to the right. It should be pretty clear what each of the options there does, creating a log-log plot of the vector x, with labels for each axis.

Next: Hypothesis testing!

Now we have some basic descriptions of the data, enough to give the reader a feel for what we are working with. Time to rearrange the data a few more times and take measurements that will help us answer questions about the relationship of feedback scores to posting behavior, in part 5.