CemWEB Research Project

Recent Entries

6/23/05 05:23 pm - Top 100 users by LiveRank

Data current as of December 2004

Look up your own results!

Name Friendscount LiveRank Overall Ranking Broadcaster- free Ranking
[info]quizdiva 724 102 1 1
[info]hipstomp 113 80 2 2
[info]status 0 69 3 none
[info]kim_jong_il__ 1 65 4 3
[info]imjinnie 736 64 5 4
[info]doctor_livsy 69 49 6 5
[info]teh_indy 0 47 7 none
[info]mistersleepless 105 45 8 6
[info]thegraybook 202 44 9 7
[info]k_richardson 0 43 10 8
[info]dimkin 302 42 11 9
[info]rcr 1 42 12 10
[info]hyujin 0 40 13 11
[info]patiencekills 601 39 14 12
[info]mcrjournal 0 37 17 none
[info]omg_iconz_ 1 37 16 14
[info]throwingstardna 305 37 15 13
[info]worthlessunited 1 36 18 15
[info]studio3dom 0 35 19 16
[info]docbrite 50 33 20 17
[info]_hdcomic 0 33 22 none
[info]cassieclaire 24 33 21 18
[info]jessichrissy 1 32 24 20
[info]pottersues 22 32 23 19
[info]theferrett 248 30 27 23
[info]dolboeb 743 30 25 21
[info]cleolinda 634 30 29 25
[info]ficbitches 2 30 26 22
[info]sam 1 30 28 24
[info]fredryk 74 27 30 26
[info]avva 535 26 31 27
[info]with_gusto 674 26 34 30
[info]jwz 129 26 33 29
[info]vadimus 0 26 32 28
[info]ladyjaida 750 24 35 31
[info]seviet 130 23 38 34
[info]ginmar 268 23 37 33
[info]theformat 65 23 36 32
[info]drugoi 602 22 40 36
[info]chingizid 159 22 39 35
[info]sexwax 1 22 41 37
[info]cmart 343 21 47 43
[info]andrewkendall 682 21 45 41
[info]5signs 1 21 44 40
[info]8mm 1 21 46 42
[info]ds_flashback 1 21 43 39
[info]the_bitchcave 3 21 42 38
[info]kompressorpower 713 20 55 51
[info]fif 2219 20 49 45
[info]zloebu4ka 601 20 48 44
[info]akuaku 728 20 57 53
[info]kostia_inochkin 728 20 56 52
[info]brad 180 20 54 50
[info]slg_news 26 20 53 49
[info]bookshop 492 20 52 48
[info]prehistoric 151 20 51 47
[info]yukipon 730 20 50 46
[info]mozgovaya 720 19 60 56
[info]polumrak 76 19 59 55
[info]katechkina 243 19 58 54
[info]nedorazumenie 471 19 68 64
[info]neivid 298 19 67 63
[info]copperbadge 72 19 66 62
[info]ishotversace 0 19 65 61
[info]zoe_trope 189 19 64 60
[info]opportunitygrrl 741 19 63 59
[info]hardartist 2 19 62 58
[info]riksowden 0 19 61 57
[info]josienutter 724 18 78 74
[info]robont 744 18 75 71
[info]horsepucky 345 18 77 73
[info]spiritrover 737 18 76 72
[info]hypnox 664 18 74 70
[info]switchknife 347 18 73 69
[info]kore 710 18 71 67
[info]ana 637 18 70 66
[info]icemaiden 0 18 72 68
[info]canticle 0 18 69 65
[info]mandelion 1 17 86 81
[info]murdershack 340 17 85 80
[info]tsl_colourbars 726 17 83 79
[info]nl 55 17 81 77
[info]krylov 681 17 80 76
[info]romochka 13 17 89 84
[info]sarahtales 134 17 87 82
[info]clixnwhistles 0 17 84 none
[info]cmpunk 7 17 82 78
[info]hitlerhitler 0 17 79 75
[info]hawthornehts 17 17 88 83
[info]ljmatch 0 16 96 none
[info]goblin_gaga 699 16 95 90
[info]maccolit 269 16 94 89
[info]olshansky 741 16 93 88
[info]apazhe 345 16 92 87
[info]p0grebizhskaya 744 16 90 85
[info]anniesj 287 16 100 94
[info]infinite_icons 722 16 99 93
[info]imomus 724 16 98 92
[info]muskrat_john 99 16 97 91
[info]cmpriest 604 16 91 86

5/4/05 12:55 pm - Cemcom Blog RSS Feed on LiveJournal

My research group's new blog is now syndicated on LiveJournal. Add [info]cemcom_rss to your friends list if you want to get your friend's page flooded with cool articles about the intersection of technology and culture, as well as relevant CFPs and maybe (every now and then) original work from one our members (including myself). The blog is fairly active; avg of one post a day, but most entries are short. Also feel free to comment on the blog itself!

3/16/05 09:58 pm - Just for Fun

Using [info]researcher2's friends graph mentioned in this post on [info]lj_research, [info]jofish22 and I have come up with some useless statistics about LiveJournal usernames.

[info]jofish22 enumerated the most common two characters to start your LJ name with here. I made a list of the frequency with which various numbers appear in LJ usernames here. Both of these are posts in [info]lj_research.

We actually hope to have useful research to show soon too.

2/1/05 12:42 am - Workshop Paper Accepted

A paper using this project as a basis has been accepted to the Beyond Threaded Conversation workshop @ CHI 2005.

It is available here: Implicit Links in Asynchronous Communication Spaces (Medynskiy, 2005).

Abstract
Online asynchronous communication spaces are rich in implicit relationships that are constructed through the collective activity of participants in these spaces. Mapping such relationships to edges between actor-nodes in the spaces often results in a graph structure with great potential to inform the design of and for these spaces. In this paper I present examples of implicit relationships in USENET, the blogosphere, and the LiveJournal community space. Further, I discuss design implications of the visualization and analysis of graph structures resulting from such links.

12/20/04 10:10 pm - Results!

This post will summarize my research results for the Fall 2004 semester.

After getting data on approximately 11,000 LiveJournal communities (approximately 1/20th of all communities), I was hoping to compare two representations/visualization of the LiveJournal community-space. One representation is based on 'member of' links between communities (explicit links), and the other is based on members communities share in common (implicit links). As per my last post, however, it turns out that the crawling code I was using was keeping an incomplete set of 'member of' relationships between communities. Thus, the only correct visualizations I have are based on common members between two given communities. Nonetheless, I think the visualizations are rather interesting and I'll spend some time describing them. In the next post, I will talk about future direction.

First, I will outline the type of data collected:
Using the ljspider.pl script (described here), I gathered data on approximately 11,000 LiveJournal communities. Among the data I colleged was a collection of membership lists of all crawled communities -- this is the only data set relevant to this post. It is very important to note that LiveJournal does not provide membership lists for communities with more than 500 members, which constituted 11.8% of all communities I crawled. For the purposes of the following analysis, these communities were dropped from the dataset. To refine the available data, I used the activecomm.pl script (described here) to highlight communities that seem to be active at least monthly (n = 30 in activecomm.pl) and only included those in my final data set. Finally, I computed the number of shared members between every pair of communities in the data set and logged all pairs of communities sharing at least 5 members. The Java code to do this final step will become available on this blog shortly.

The collected data can be thought of as representing a graph: every community is a vertex and the number of members two communities have in common is the weight of the undirected edge between the two nodes representing those communities.

I've been using Graphviz, a free open-source graph visualizer from AT&T Research, to visualize my data. Other visualization tools I looked at, like Walrus or Pajek were not able to produce acceptable results -- Walrus cannot deal with graphs that is not trees and I could not get Pajek to arrange the nodes of the graph in any sort of logical order. However, Graphviz has a hard time working with large graphs and so in order to be able to use it for visualization I needed to constrain my original data set in the following ways:

  • Communities with less than 100 members were dropped

  • Communities that were not active at least monthly, as judged by their last five posts, were dropped. (see post about activecomm.pl here)

  • Edges representing less than 50 common members between two communities were dropped</il>


Any edges pointing to a dropped community vertex, and any community vertex without non-dropped outgoing/incomming edges were also dropped. I created a DOT file representing the remaining community vertices and edges and visualized it with both the dot and neato utilities in the Graphviz package. The dataset included a lot of 2-, 3-, and 4- node subgraphs that I did not think were worth time analyzing, so I removed any subgraph with less than 5 community nodes.

I reran neato to generate a visualization of this final dataset, and tighted the image up (removed a lot of whitespace) with Photoshop. A scaled-down version of the image is behind the lj-cut and the full-size image is available by clicking on it.

Figure 1: Visualization of LiveJournal community space using shared-members links )

The visualization shows various communities-of-communities in the LiveJournal community space that are tightly connected by sharing at least 50 members between some of their member communities. The community vertices themselves have a size representative of the number of total members, and are colored based on activity levels. Note that the n in the activity metric is the number of days given to the activecomm.pl script. Coarsely, however, we may say that red communities recieve posts on average once between every month and every week, yellow communities once between every week and every day, and blue communities are posted to daily or more frequently.

One of the strengths of the shared-members visualization lies in its ability to capture and make salient the strength of inter-community ties, and possible paths of information flow between communities (e.g. though cross-posting or linking). Let's examine a small porton of the full visualization more carefully.

Figure 2: Blow-up of a section of Figure 1 showing three communities-of-communities subgraphs )

Note the five connected, yellow communities in the top-left corner of Figure 2, for instance. The four communities that form a clique in the graph are players in LiveJournal's fake hair user-community (fake hair is popular with individuals identifying themselves with the goth, cybergoth, and nightclub lifestyles). There is a 'market' community where users may sell or buy fake hair and accessories, a 'pix' community where users post pictures of themselves wearing fake hair, and two other fake hair general-interest communities. All of these communities are active at least on a weekly basis (n = 7) and share between 50 and 100 members in each pair, indicating strong inter-community ties and most likely pointing to a high rate of information flow through the communities (e.g. through cross-posting). In the same group of communities, the 'cyberwarez' community is least connected to the rest, hinting at somewhat different set of interests than the rest of the communities in the graph. Indeed, I found the 'cyberwarez' community to be one where users could buy and sell club and party clothing for the goth/cybergoth/nightclub scene - related to, but distinct from the fake hair communities that made up the rest of the graph.

The other two subgraphs present in Figure 2 can be analyzed similarly. The large, mostly-blue subgraph represents a large, active, and somewhat tightly-knit community of buy/sell/exchange communities dedicated to used clothing, accessories, and similar goods. On the fringes of this group of communities are marketplace communities for a less well-defined range of goods ('marketplace', 'trade_stuff', 'subcultauctions') as well as a community specializing in hand-made goods ('__handmade'). The last subgraph seems to show an interesting link between the user-community of music producers and user-community of LiveJournal users interested in aspects of the music industry. The connection seems to be mediated through the link between the 'audioeng' and the 'musicbiz' communities.

Showing mediation between communities-of-communities with similar (or not-so similar) overarching interests seems to be a potent ability of the shared-member representation (see Figure 3).

Figure 3: Example of mediation in LiveJournal community-space )

Note how the ‘europe_history’ and especially the very active ‘middle_ages’ community in Figure 3 seem to mediate the general-interest history community-group (‘askahistorian’, ‘history’, and ‘historystudents’) and the ancient Greco-Roman history community-group (‘roma_antiqua’ and ‘classics’). I fully expect that as the dataset becomes less constrained (e.g. the cut-off value of shared number of members at which edges are dropped is lowered), such mediatory relationships will grow in number and diversity.

In my next post, I will draw some final conclusions about this method of visualizing/representing the LiveJournal community-space, and discuss further paths of research.

12/19/04 09:14 pm - Serious bug in ljspider.pl

Rather unfortunately, it turns out that ljspider.pl has a serious problem with regard to harvesting 'member of' links between communities. Basically, once a community is seen, no further 'member of' links to pointing to it will be logged. This explains why all of the 'member of' graphs I've been generating are trees (every node has a maximum of one edge leading into it).

The fixed code is here: ljspider.pl.

12/13/04 11:54 pm - Another paper for Related Readings

While writing up and looking for references for my LIFE Undergraduate Research Report, I came across this short paper out of Microsoft Research: A Matter of Life or Death: Modeling Blog Mortality by Gina Venolia. The paper tries to fit a simple model of blog "life and death" to some raw data from LJ stats (http://www.livejournal.com/stats/stats.txt). More interestingly, however, the paper provides some nice graphs of LJ new-blog-per-day and new-post-per-day activity (including finding a relatively interesting weekly cycle), so instead of trying to come up with graphs of this data on my own, I'll just cite this from now on. Yay!

Added to Related Readings post too.

12/13/04 01:18 pm - UCHS Exemption

I have just received an email stating that this research has been reviewed by the Cornell UCHS Administrator and is exempt from the federal regulations for the protection of human subjects.

What a fast turn-around!

12/12/04 09:42 pm - Human Subjects Approval and Plan for Paper

I have finally submitted my UCHS (Human Subjects) initial approval form, and hopefully they'll be able to get back to me before the end of the semester. I feel this reserach easily deserves expedited approval, since I'm collecting only publically available data and no attempt is made to extract personally-identifiable data.

I'm going to try and submit a paper about this research to this workshop at the CHI 2005 conference (being held this comming April in Portland, OR). The workshop title is Beyond Threaded Conversation, and so I'd have to bill this research as some sort of conversation-visualization or tracking system... Sounds quite doable, especially considering the last round of analysis and visualizations I've been doing.

Soon I'll be posting graphs and visualizations I made over the last few days. I think they're rather neat, if not too novel/exciting. For now, I'm concentrating on writing a report for the LIFE people (who gave me a reserach grant at the beginning of the summer) and something I can give to Dan/Phoebe so that they can see what happened with this project over the semester/give me a final grade.

12/4/04 09:58 pm - A note on ljspider.pl -- it's a *feature* not a bug (:

It turns out some communities that ljspider.pl thinks it has crawled successfully actually fail because of non-response of the LJ servers. This leaves an empty file as their cache, and thus these communities are easy to single out and rerun. You can put these communities into a new queue_community (saving the old one), delete their _info files (so that the script knows to hit n, run the script until all the unprocessed communities are taken care of, break, and then combine the old queue_community and the new one. Since seen_community remains, no duplicate communities should be listed (same for seen_user).

I'm not going to fix the ljspider.pl code right now, since the work around is rather straightforward and it doesn't seem to happen so often. I may get around to doing that at some point down the line, however.

11/23/04 08:33 pm - Looking Forward (Side Projects)

Beyond my plans for the main portion of this research, which will hopefully be done by the end of the semester, I have a few small/side projects that I would like to at least think about, if not do, at some point. It seems that with the BOOM '05 Poster Session, and my technical writing requirement writings, I'll be continuing with this project through winter break and into the Spring semester.

These side projects are (in order of importance):
1. Analyze interesting features of communities that specify their location. For example, go through a community's member list, and see how many of the users actually list themselves in the same location. List users who are 'outcasts' of a community because they list a location drastically different from the majority of other members in the community. It seems that interesting critical design work could be done here.
2. As per this post, I was collecting community posting activity for a number of different communities a bit before, during, and for a couple of days after the Presidential Election. It would be interesting to visualize this data in an interesting fashion and see that kinds (if any) increases of activity can be seen in political, semi-political, and a-political communities.
3. Automatic friend-list cutters. This doesn't necessarily have much to do with the main ideas behind this research, but would make interesting critical designs. A script could algorithmically determine which users are worth keeping on your friends list, letting you distance yourself from people who might get upset if you manually cut them from your friends list. "Best friends" and similar scripts/memes have been circulating on LJ for a while, and it might be interesting to (a) invert their actions by actually cutting/dropping friends, and also to examine the cultural implications such a script may have.

11/23/04 02:24 am - Looking Forward (Main Project)

The following is an outline of the major elements of this research project I would like to have completed before the end of this semester:

Current Status
I have about 11,000 LiveJournal communities crawled, with all Community Info pages cached locally. The information about these communities that is already parsed is: number of members in each community; membership lists for every community with less than 500 members; list of interests for every community; every community a given community has a 'member of' link with; location of community, if provided; whether or not the community is 'active' or not (see here).

To Do (updated)
1. Create a graph representation of the community network with communities as nodes and 'member of' links as directed edges. (Graph A)
2. Create a graph representation of the community network with communities as nodes and undirected edges between the nodes if two communities share members. Edges would be weighted by the number of overlapping members (possibly divided by the product of the sizes of the two communities). (Graph B)
3. Visualize Graphs A and B, analyze their structure, and compare various properties/attributes of the graphs, both qualitatively and quantitatively. For example it would be interesting to compare the denseness of the edge lists, comment on connectedness properties, and examine some other interesting, defining characteristics that may jump out. Graph B should exhibit some clustering with high-weighting edges from groups of users who join similar communities. An arguably more interesting feature to look for would be unexpected groups of clustered communities -- those that seem to share no obviously similar interests, but are tightly connected by groups of common users anyway.
4. Sort the edges list of Graph B in ascending order and see if the top 10, 20, or 100 more user-sharing communities exhibit any interesting properties.
5. Use either a clustering or graph-cut algorithm to separate the graph into some number of logical chunks. From these it might be possible to extract 'representative' communities and interests, and in general see how the graph gets split as the number of cuts/clusters increases. (E.g. does it split by national/language boundaries? political ones? hierarchies of interests? etc.)

11/21/04 08:46 pm - LIFE-Funded Computer Delivered!

The Dell Inspiron we bought with our LIFE funding money was finally delivered to my desk at the HCI Group in the Information Science Building. Specs: P4 2.8Ghz, 1 GB RAM, 150 GB disk. The Computer Facilities staff have already installed Red Hat Enterprise Edition on it, making a single 100 MB boot partition, 2 GB swap partition, and the rest putting into a single ext3 root partition. I'm tempted to install Gentoo on this machine, and make a couple of separate partitions. So far I haven't had problems with ext3, but I am putting thousands (and thousands) of little files on the disk anytime I do a LJ crawl. It seems ReiserFS could handle these better, and be faster as well. I'm going to go track down a FS comparison of some sort, and if Gentoo install seems like it would go fast, slap that onto the system.

Edit: From http://www.gentoo.org/doc/en/handbook/handbook-x86.xml?part=1&chap=4#doc_chap4 ---- ReiserFS is a B*-tree based filesystem that has very good overall performance and greatly outperforms both ext2 and ext3 when dealing with small files (files less than 4k), often by a factor of 10x-15x. [emphasis mine] Sounds excellent! A few reports of ReiserFS misbehaving on RedHat systems, so I'm not going to chance it and go for the Gentoo install.

11/8/04 04:41 pm - Bug in ljspider.pl

There was an unfortunate bug in the previously posted version of ljspider.pl Membership lists were not culled from certain community info pages, due to a small inconsistency in HTML between different info pages. This bug has been fixed, but if you used the script, you should make sure any communities appearing in your commXmems file with less than 500 members are actually logged in commXmem. If you have your community info pages, the ones that were not properly scraped are those that would be returned upon running the following command in the directory with all the _info pages:
grep -l "Members</a>:</b></td><td colspan='1'><b>.*:</b>" *_info | perl -pe 's/_info$//;'

You can then put these into a new queue_community (saving the old one), run the script until all the unprocessed communities are taken care of, break, and then combine the old queue_community and the new one. Since seen_community remains, no duplicate communities should be listed (same for seen_user).

The new version of ljspider.pl can be found here.

11/8/04 01:27 pm - Election-time Community Activity

Starting midday November 1st, I have been collecting RSS feeds from a number of political and non-political communities to see if and how the US election affects posting patterns. The communities I'm looking at in particular are: [info]infojunkies, [info]radicalnyc, [info]conservatism, [info]politicsforum, [info]israel_arab, [info]worldpolitics, [info]libertarianism, [info]rightist, [info]socialists, [info]greenparty, [info]debate, [info]sos_usa, [info]electronicmusic, [info]nycnobody, [info]lj_dev, [info]cornell_u, [info]1962, [info]random_things, [info]dork_power, [info]gardening. It would be interesting to see if the election somehow affected apolitical communities as well as politics and debate communities.

I've been collecting the RSS feeds between every 12 hours and every 48 hours, but as each RSS feed contains dates for the last 24 posts, no posts should have been missed. I will continue collecting data for the next few days to see what kinds of baselines the communities return to.

I'm thinking about and looking for suggestions as to interesting ways to present or interpret this data.

11/2/04 12:48 am - More scripts

I am making two other scripts available too:
locationparser.pl: Given a list of communities, this script will output a list of comma-separated values corresponding to their locations, if available. It can use cached data (from the LJ Spider script described in the previous post) or go to the LiveJournal server for fresh data (which it will then cache).
Sample use: perl ./locationparser.pl < communities > locations ... Note it is possible to generate a list of communities from commXmems, which is generated by ljspider.pl (discussed in previous post) by: perl -pe '@a = split(/, /); $_ = @a[0] . "\n";' < commXmems > communities

activecomm.pl: Given a list of communities, this script will output a list of communities it identifies as active. Read the header of the script for use instructions, but here's how I ran it:
perl ./activecomm.pl -c -m -n 30 < allcommunities > activecommunities will return communities whose last 5 posts are all within 30 days of each other, and the last post is within 30 days of today. Communities with less than 5 posts will not be considered active (-m) and cached data will be used before attempting to hit the server (-c). This script uses RSS feed of a community's posts. Also note that active, but private communities will be marked inactive since their post-histories cannot be accessed.

Please use the comments section for questions or feedback.

11/2/04 12:24 am - Another Crawl

Two days ago I conducted another, much shorter, crawl of LiveJournal, this time starting with the [info]conservatism, since last time I started with [info]infojunkies, which led me towards more liberal communities, at least initially.

The LiveJournal crawling script I wrote is available here. The script will explore communities in breadth-first order, first taking communities the initial community is a 'member of', and then communities that are members. When those are exhausted (which didn't happen in my initial crawl, but did happen in this second one), it will go and start parsing info of initial community members, adding communities they are friends or members of to the community queue. The following information is stored, in the following files:
[snip]
# Data logs:
my $commXmem_log = "commXmem"; # community x member
my $interest_log = "interest"; # (member | community) x interest
my $userXfriend_log = "userXfriend"; # user x friend
my $commXmems_log = "commXmems"; # community x number_of_members
my $commXcomm_log = "commXcomm"; # community x other_community_it_is_member_of
[/snip]

All files are comma-separated and hopefully ready for importing into a database.

The program will also attempt to exit in a consistent state (which sometimes takes a bit of time, and can be overridden), and save its queue of communities and users, as well as a list of communities/users it has seen. On next run, the script will restart from where it had previously left off. Also note that the script will store local copies of all files it downloads (including fdata and user info pages). This can quickly become a huge number of files (especially problematic with FAT32 filesystems), and can indeed take up a lot of space too, especially if crawls last a few days.

A note: To run a crawl, and then start again with a new community, but without removing the results of the previous crawl, move old queue_communities to queue_communities.old. Then `echo [new starting community] > queue_communities` . Start crawl anew. When it finishes, cat the queue_communities and queue_communities.old together.

Please post comments to this entry if you have questions, need any help, or have suggestions. Also please read the source code.

10/27/04 12:42 am - Using this research to fulfill the School of Engineering's Technical Writing Requirement

I'm going to try to have this research count as my Technical Writing Requirement in the College of Engineering at Cornell, as part of a new program offered to students conducting LIFE-sponsored independent research. Though this may not sound very relevant to the specifics of the research, it is going to dictate how and where I will present results, and thus directions I may focus on.

The idea behind the program is for me to do four different "communication" projects, two dictated by the program and two of my own choosing. After those are completed, my Technical Writing Requirement is met. The great thing is that these projects don't have to be completed by the end of the semester.

A proposal I submitted lists the following projects I would like to undertake:

  • Final Paper -- Dictated by the program. Basically what a lot of my grade for this class will be based on, and will also be sent to the LIFE people (with possible modifications) to justify the execution of this research.

  • Poster -- Dictated by the program. I will create a poster to be exhibited at the 2005 BOOM (Bits On Our Minds) poster session here at Cornell. This poster may later be modified if I want to submit it to other sessions.

  • Electronic Research Journal -- Proposed by me. This page! If this proposal is accepted, the updates on this page will become more comprehensive and I will talk more about the social implications for this research, further directions Phoebe, Dan, and I are working on, my experiences programming and crawling LiveJournal, etc. Basically, I would convert this page into a useful resource for researchers interested in similar projects as this (whether they involve LiveJournal, online community formation, or some other facet of this research). I would also post script/code updates and copies of the data I mined (which I fully intend to do anyway).

  • Academic Paper OR Verbal Presentation -- Depending on how interesting my results and critical designs are, I will either try to write and submit a paper to some academic conference or journal (maybe CHI2006 or DIS2006 if I'm really good...) If this doesn't quite go through, I will instead opt for giving some sort of verbal presentation on this project. This might be in the form of a basic technical-intro presentation for interested sociology/communication students, a presentation focusing on theory application for CS/IS students, or a general interest presentation for either interested faculty/students or maybe for local high school students interested in cross-disciplinary research.

10/19/04 01:29 pm - Data Collection Update

This weekend my LJ spider gathered data on 4239 communities. In particular, the following information was collected:
(1) Number of members of every checked community
(2) For any checked community with less than 500 members, the members of that community (including other communities)
(3) The interests of every checked community
(4) The 'member of' relations of every checked community

Interestingly, I found that there were only 211075 unique personal-account members in the 4239 communities.

The script I used to gather this data is available from me, if you email the email listed in the userinfo for this account... I'll be uploading it somewhere at some future time, but not before I deal with some issues that cropped up during the crawl, such as getting it to be better at saving state if the user breaks execution... It's pretty usable as it, however.

The data is available by request.

10/14/04 11:31 pm - LJ Crawl In Progress

At this moment, my laptop at home is crawling LJ for community memberships. This is a test run to see how the software performs, if it has bugs, etc. Unless it crashes, it should be running all night at which point I will also be able to tell how much of LJ I can crawl in reasonable amounts of time (in my last correspondence with LJ devs, I was told there are about 203k community journals).

When the bugs are ironed out, I will post the script (a heavy, heavy modification of [info]emmastrange's ljbot.pl script found here) and a breakdown of the logic behind the crawl.

Right now all data is being saved in comma-separated text files, and I'm caching all friend-data and user/community-info pages I get. Once the new Dell system arrives, I will import the text files directly into a MySQL database for easy access.
Powered by LiveJournal.com