NanoHistory Tools: Webs

NanoHistory's use of graph or network models immediately lends itself to creating the usual force-directed representations of networks that we've grown accustomed to over the past decade or so. For the inhouse network visualization tool, which I'm calling 'webs' for lack of a better name, I've opted to adapt D3's well known force directed example. I've mashed it up with some later versions by other D3 designers, and tweaked it for our use.

Two issues related to scale affected the development of this core tool. The first was building an effective query engine that would allow for users to create visualizations of data as needed from the overall NanoHistory collection. The second was handling the dataset and the force direction during rendering, both in terms of 'ticks' and scaling.

We haven't deployed a network or graph database solution like Neo4JS yet as it's not clear how Neo4JS would related to the other kinds of data representations we'd like to create, and the overall processing / infrastructure of the platform. It's slated for definite review. In the meantime, however, we needed to build an effective means for creating queries that would do the following:

  • Allow users to create webs using one or more primary entities as central or core nodes, with a certain distance from each.
  • Limit the types of entities visualized, say just people, places, or things.
  • Limit or define the types of verbs used to create the web, either using verbs themselves, or pre-defined verb groups for specific kinds of networks.
  • Include or exclude the queried central or core nodes in the result.
  • Include or exclude common verbs pertaining to common states (created, born, etc.) and composition (mentioned, cited, etc. as all entities are supposed to have a clear evidence or providence chain) from the result in order to allow for greater precision.
  • Select all entities from a particular source.
  • Select all entities that match keyword terms.
  • Limit the results by dates, either by greedy or restricted matching.

While software like Neo4JS would allow us to do this much faster, for the time being we're using a job-based system which allows users to create a query, and return to monitor its status, and re-use the results as many times as they wish. Despite taking a bit longer, this has several advantages. First, we can create using default settings via buttons on other pages (i.e. when looking at a particular record, you can just click on the webs link, and it creates a query job which can be accessed at a later date without disrupting the workflow). Second, because every query is treated as a job users end up creating draft datasets that can be saved for later, return to when needed, or act as snapshots to monitor research development. On the back end, the benefit is simply computational overhead - running through thousands of nodes and possible edges to create the dataset is intense: running query threaded through the web browser ended up simply overwhelming the entire site. So things had to change.

At the moment, a job is defined by a user, and a separate cron job checks for the next unprocessed job and runs it locally, dumping the results back into the database, and updating the status as needed. Users can then access the dataset through the webs interface when needed. Since the result is stored as data, it's easy to export, and to port to different viewers. We have a D3 version of the webs, and a draft SigmaJS version.

Both the D3 and SigmaJS visualizations present problems when it comes to layout in the browser. Our adaptation of the D3 force directed graph works extremely well for smaller data sets, but ends up halting browsers beyond 2000+ nodes. The main issue is the example's use of 'tick' to relay and handle the force direction: each movement requires recalculation of each node's X & Y coordinates, which is too heavy for a browser's memory and a client's system. The solution was to count the number of nodes returning in the result, in order to find an optimum ratio between the number of 'ticks' required to create an effective layout without having to worry about too many ticks. In short, the D3 version does a few ticks, and then stops.

The other issue with the D3 visualization is zoom or scaling. I've managed to get a slide bar to work in Firefox which allows users to zoom in or out as needed on a web. During rendering I can preset the slider, there by allowing the visualization different zoom levels depending on the number of nodes in the result. Also, users can drag and reposition the visualization as needed. The slider, however, fails to work appropriately in Chrome.

The SigmaJS viewer is not as advanced. Without a server-side implementation of Gephi, there is no effective way to handle the necessary layout and positioning needed by SigmaJS to create a clean visualization. At the moment position of nodes is randomized within the dimensions of the viewer container <div> element.

Revision Plans

Querying

  • Implement dates
  • Implement keyword searching
  • Refine / tweak 'source' based searching

Rendering

  • Explore Neo4JS implementation and how it would affect the rest of the core infrastructure of NanoHistory
  • Explore NodeJS for some aspects of the rendering
back