After getting data on approximately 11,000 LiveJournal communities (approximately 1/20th of all communities), I was hoping to compare two representations/visualization of the LiveJournal community-space. One representation is based on 'member of' links between communities (explicit links), and the other is based on members communities share in common (implicit links). As per my last post, however, it turns out that the crawling code I was using was keeping an incomplete set of 'member of' relationships between communities. Thus, the only correct visualizations I have are based on common members between two given communities. Nonetheless, I think the visualizations are rather interesting and I'll spend some time describing them. In the next post, I will talk about future direction.
First, I will outline the type of data collected:
Using the ljspider.pl script (described here), I gathered data on approximately 11,000 LiveJournal communities. Among the data I colleged was a collection of membership lists of all crawled communities -- this is the only data set relevant to this post. It is very important to note that LiveJournal does not provide membership lists for communities with more than 500 members, which constituted 11.8% of all communities I crawled. For the purposes of the following analysis, these communities were dropped from the dataset. To refine the available data, I used the activecomm.pl script (described here) to highlight communities that seem to be active at least monthly (n = 30 in activecomm.pl) and only included those in my final data set. Finally, I computed the number of shared members between every pair of communities in the data set and logged all pairs of communities sharing at least 5 members. The Java code to do this final step will become available on this blog shortly.
The collected data can be thought of as representing a graph: every community is a vertex and the number of members two communities have in common is the weight of the undirected edge between the two nodes representing those communities.
I've been using Graphviz, a free open-source graph visualizer from AT&T Research, to visualize my data. Other visualization tools I looked at, like Walrus or Pajek were not able to produce acceptable results -- Walrus cannot deal with graphs that is not trees and I could not get Pajek to arrange the nodes of the graph in any sort of logical order. However, Graphviz has a hard time working with large graphs and so in order to be able to use it for visualization I needed to constrain my original data set in the following ways:
- Communities with less than 100 members were dropped
- Communities that were not active at least monthly, as judged by their last five posts, were dropped. (see post about activecomm.pl here)
- Edges representing less than 50 common members between two communities were dropped</il>
Any edges pointing to a dropped community vertex, and any community vertex without non-dropped outgoing/incomming edges were also dropped. I created a DOT file representing the remaining community vertices and edges and visualized it with both the dot and neato utilities in the Graphviz package. The dataset included a lot of 2-, 3-, and 4- node subgraphs that I did not think were worth time analyzing, so I removed any subgraph with less than 5 community nodes.
I reran neato to generate a visualization of this final dataset, and tighted the image up (removed a lot of whitespace) with Photoshop. A scaled-down version of the image is behind the lj-cut and the full-size image is available by clicking on it.
Figure 1 Visualization of LiveJournal community space using shared-members links
The visualization shows various communities-of-communities in the LiveJournal community space that are tightly connected by sharing at least 50 members between some of their member communities. The community vertices themselves have a size representative of the number of total members, and are colored based on activity levels. Note that the n in the activity metric is the number of days given to the activecomm.pl script. Coarsely, however, we may say that red communities recieve posts on average once between every month and every week, yellow communities once between every week and every day, and blue communities are posted to daily or more frequently.
One of the strengths of the shared-members visualization lies in its ability to capture and make salient the strength of inter-community ties, and possible paths of information flow between communities (e.g. though cross-posting or linking). Let's examine a small porton of the full visualization more carefully.
Figure 2 Blow-up of a section of Figure 1 showing three communities-of-communities subgraphs
Note the five connected, yellow communities in the top-left corner of Figure 2, for instance. The four communities that form a clique in the graph are players in LiveJournal's fake hair user-community (fake hair is popular with individuals identifying themselves with the goth, cybergoth, and nightclub lifestyles). There is a 'market' community where users may sell or buy fake hair and accessories, a 'pix' community where users post pictures of themselves wearing fake hair, and two other fake hair general-interest communities. All of these communities are active at least on a weekly basis (n = 7) and share between 50 and 100 members in each pair, indicating strong inter-community ties and most likely pointing to a high rate of information flow through the communities (e.g. through cross-posting). In the same group of communities, the 'cyberwarez' community is least connected to the rest, hinting at somewhat different set of interests than the rest of the communities in the graph. Indeed, I found the 'cyberwarez' community to be one where users could buy and sell club and party clothing for the goth/cybergoth/nightclub scene - related to, but distinct from the fake hair communities that made up the rest of the graph.
The other two subgraphs present in Figure 2 can be analyzed similarly. The large, mostly-blue subgraph represents a large, active, and somewhat tightly-knit community of buy/sell/exchange communities dedicated to used clothing, accessories, and similar goods. On the fringes of this group of communities are marketplace communities for a less well-defined range of goods ('marketplace', 'trade_stuff', 'subcultauctions') as well as a community specializing in hand-made goods ('__handmade'). The last subgraph seems to show an interesting link between the user-community of music producers and user-community of LiveJournal users interested in aspects of the music industry. The connection seems to be mediated through the link between the 'audioeng' and the 'musicbiz' communities.
Showing mediation between communities-of-communities with similar (or not-so similar) overarching interests seems to be a potent ability of the shared-member representation (see Figure 3).
Figure 3 Example of mediation in LiveJournal community-space
Note how the ‘europe_history’ and especially the very active ‘middle_ages’ community in Figure 3 seem to mediate the general-interest history community-group (‘askahistorian’, ‘history’, and ‘historystudents’) and the ancient Greco-Roman history community-group (‘roma_antiqua’ and ‘classics’). I fully expect that as the dataset becomes less constrained (e.g. the cut-off value of shared number of members at which edges are dropped is lowered), such mediatory relationships will grow in number and diversity.
In my next post, I will draw some final conclusions about this method of visualizing/representing the LiveJournal community-space, and discuss further paths of research.