Archive for the ‘Programming’ Category

Complex Joins in CouchDB

Thursday, July 29th, 2010

By nature of all so-called „No-SQL“ databases, there is no query language which offers the same expressiveness as SQL. In particular, there is no notion like “joins”. In most cases, this is not an issue since the more flexible structure of a document makes it unnecessary to distribute information to different tables, i.e. there is simply nothing to join.

However, if you need to do something similar to joins, there is a solution: view collations. The original idea has already been described by Christopher Lenz back in 2007. With CouchDB version 0.11, the include_docs parameter improves this approach.

Combining view collation with appropriate list functions provides an easy way to combine several documents into one complex JSON document which can be queried.

Let’s look at an example of a many-to-many relationship based on the data model I use for the Haynes catalog. A “work” represents a piece of music which has been written by one or more “composers”. Needless to say, a composer may have written several works. Users can add comments to works. For various reasons I would like to keep these three types of documents normalized, i.e. works, composers and comments are stored in different components.

In a CouchApp, I would like to render a work including composer information and the comments. Since the result should be search-engine friendly, I would like to avoid javascript to load the data asynchronously.

Ideally, I would like to have an enhanced work document which includes an array of composers and comments.

To see how this can be done, let’s look at an example:

{_id: “work_1”, type: ”work”, title: “Title of work 1”, composer_ids: [“composer_1”, “composer_2”] }
{_id: “composer_1”, type: ”composer”, name: “Name of composer 1” }
{_id: “composer_2”, type: ”composer”, name: “Name of composer 2” }
{_id: “comment_1”, type: “comment”, text: “Comment for work 1”, work_id: “work_1” }
{_id: “comment_2”, type: “comment”, text: “Another comment for work 2”, work_id: “work_1” }

As you can see, there is one work written by two composers and having two comments.

I would like to create the following document which could be used in my template to render html:

{_id: “work_1”, type: ”work”, title: “Title of work 1”,
    composers: [
        {_id: “composer_1”, type: ”composer”, name: “Name of composer 1” },
        {_id: “composer_2”, type: ”composer”, name: “Name of composer 2” }],
    comments: [
        {_id: “comment_1”, type: “comment”, text: “Comment for work 1”, work_id: “work_1”},
        {_id: “comment_2”, type: “comment”, text: “Another comment for work 2”, work_id: “work_1”}]
}

Wouldn’t it be great if I could use this document in may template and simply write <%= work.composers[0].name %> instead of ajax calls?

Combining the known techniques from above with a list function, gives you exactly this possibility. Let’s start with the map function for the view called “works”:

function (doc) {
	if (doc.type == "work") {
		emit([doc._id], {_id:doc._id});
		for (var ix in doc.composer_ids) {
			emit([id, "composers", ix], {_id: doc.composer_ids[ix]});
		}
	}
	if (doc.type == "comment") {
		emit([doc.work_id, "comments", doc._id], {_id: doc._id});
	}
}

There is no reduce function for this view. Actually there is an approach to use reduce functions to merge the documents into one document, but as you can see in the comments to this article this is not advised for larger data sets.

Instead, we use a list function, let’s call it “join”:

function(head, req) {
    provides("json", function() {
        var data = [];
        var dataItem = {};
        var row, key;
        while (row = getRow()) {
            if (row.key.length == 1) {
                if (dataItem._id && dataItem._id != row.value._id) {
                    data.push(dataItem);
                }
                dataItem = row.doc;
            } else {
                object = dataItem;
                for (var i = 1; i < row.key.length - 1; i++) {
                    if (!(row.key[i] in object))
                        object[row.key[i]] = [];
                    object = object[row.key[i]];
                }
                object[row.key[row.key.length - 1]] = row.doc;
            }
        }
        data.push(dataItem);
        return toJSON(data);
    });
}

Before we go into detail what happens in this method, let’s look at the result of the query http://localhost:5984/db_name/_design/design_name/_list/join/works?include_docs=true&startkey=[“work_1”]&endkey=[“work_1”,{}]:

[{_id: “work_1”, type: ”work”, title: “Title of work 1”,
    composers: [
        {_id: “composer_1”, type: ”composer”, name: “Name of composer 1” },
        {_id: “composer_2”, type: ”composer”, name: “Name of composer 2” }],
    comments: [
        {_id: “comment_1”, type: “comment”, text: “Comment for work 1”, work_id: “work_1”},
        {_id: “comment_2”, type: “comment”, text: “Another comment for work 2”, work_id: “work_1”}]
}]

The result is an array of the enhanced work documents. In a real example, the “data” array or rather only the first element would be used in a template to provide html instead of json, but you get the idea.

Now let’s have a closer look at what happens in the list function. If the key is an array with only one element (row.key.length == 1), we store the corresponding document. If we encounter a new one, we push the old one as this is obviously completely processed.

If the key has more than 1 element, the additional elements contain the property names of the final document which should contain the document. For example the key [“work_1”, “composers”, 0] means that work[“composers”][0] (=work.composer[0]) should contain the document for this key. The for loop makes sure that the properties exist (if not, they are initialised as []). Finally the document can be stored in the property.

As a result, you can create arbitrarily complex compound objects joining several separate documents. As you will typically join only a relative small number of documents, the list function performs well.

Proposal for graph readers in JGraphT

Thursday, July 8th, 2010

One part of my development on Hypergraph which I recently started again after a long break is to replace my own graph implementation with JGraphT. JGrapht is a nice framework to create graph structures in Java and it offers various algorithms to process graphs. For the hyperbolic graph visualisation I do with Hypergraph, this is a very good alternative to my self-written “graph library”. It also includes the possibility to serialise graphs using for example the GraphViz dot format or the GraphML XML format. However, it doesn’t offer any possibility to read these graph formats.

The following is a proposal how to create readers for JGraphT. Please feel free to leave a comment, an additional proposal etc.

The GraphReader interface

The parsing of a file and the creation of the graph should be done by a instances of a common GraphReader interface. This interface could be:

public interface GraphReader<V, E> {
    public void setComponentAttributeProvider(ComponentAttributeProvider<?> attributeProvider);

    public Graph<V, E> read(String filename,
        VertexFactory<V> vertexFactory, EdgeFactory<V, E> edgeFactory);

    public Graph<V, E> read(InputStream inputStream,
        VertexFactory<V> vertexFactory, EdgeFactory<V, E> edgeFactory);

    public Graph<V, E> read(URL url,
        VertexFactory<V> vertexFactory, EdgeFactory<V, E> edgeFactory);

}

The ComponentAttributeProvider can be provided before creating the graph; in this case, attributes like weight, color or other information provided in the file will be stored in the ComponentAttributeProvider. This is not mandatory – if not available, the attributes are simply ignored. For a discussion about storing the attributes, see below.

The read methods return a Graph instance; the actual type of the graph depends on the data in the file. For example the graph format can define a graph as directed/undirected (for example “digraph” in the dot format); the created graph may hence be a DirectedGraph or an UndirectedGraph.

Questions:

  1. Which read methods should be implemented? Are the three methods I mentioned above enough?
  2. Which parameters should the read methods have? Shold we offer read methods without an EdgeFactory, which default to ClassBasedEdgeFactory?
  3. How should exceptions (IOException etc) be handled? This includes parsing errors (wrong format).

Generic Reader

There should be a generic class which hides the complexity of choosing the correct parser. When this generic reader is used, the only information passed to the reader might be a file name or even only an input stream. The reader should derive information about the format by analysing for example the file name, the mime type or the format of the actual data, based on some heuristics.

Some example for heuristics:

  1. if the file name is given and ends with “.xml” or “.dot” => the file is an xml or dot file and the corresponding reader is used
  2. if the mime type is available, use xml reader for the xml mime type
  3. if the first bytes “look like” a dot file (i.e. the file starts according to the dot grammar [strict] (graph | digraph) ), use the dot reader

Example how to use such a generic reader:

    GraphReader<V, E> reader = new GenericReader<V, E>();
    ComponentAttributeProvider attributeProvider = new DefaultComponentAttributeProvider();
    reader.setAttributeProvider(attributeProvider);
    Graph<V, E> graph = reader.read(“some_file.dot”, vertexFactory);

In this case, the read(..) methods would assume that the file has the dot format because of the file name. It would instantiate a corresponding reader (say DOTReader) and delegate the actual parsing to this reader.

Storing attributes on vertices and edges

Most graph formats include attributes for vertices and edges, for example a weight or color information. There must be a way to store these attributes either on the graph elements (i.e. vertices or edges) themselves or by maintaining a map for each element which stores the attribute keys and corresponding values. I prefer the second approach for two reasons:

  1. There exists already a ComponentAttributeProvider which has been introduced to support attributes when dot files are created.
  2. Storing the attributes on the elements would imply that the elements implement some interface, probably an interface which defines something like “Object getAttribute(String key)”. This would contradict the philosophy of JgraphT since the JgraphT code makes no assumption on the vertices and edges at all.

Hence the readers should store any attributes using ComponentAttributeProvider (probably in a slightly enhanced version) and should make no assumptions on the elements. Even then, there are different ways to work with the attributes:

Variant 1:
The vertex and edge classes don’t know about the attributes; to access the attributes, the program has to access the ComponentAttributeProvider.
Variant 2:
The elements are instances of a special class which “knows” about the attribute provider. The vertex/edge class has accessors to for example a label (i.e. public String getLabel()) and the implementation of this method delegates to the attribute provider. In particular, the elements don’t store the actual attribute values.
Variant 3:
The elements store the actual attribute values and offer appropriate accessors. The attribute provider knows the type of the element and when getAttributes() is called, the attribute provider creates a map and initialises it with the attributes read from the vertex.

Both variant 2 and 3 assume that there is a special vertex or edge class; this is possible because the Reader accepts a VertexFactory and EdgeFactory.

Reanimating Hypergraph

Monday, July 5th, 2010

After a break of several years, I decided to continue the Hypergraph project. For those of you who don’t know Hypergraph: this is an open source Java framework which provides ways to model hyperbolic geometry and visualises graphs and trees using hyperbolic geometry. There has been some research on hyperbolic trees which shows that especially hierarchical data can be view very well using hyperbolic trees.

Every few weeks I received a mail asking for the current status of the project, whether it’s completely dead or whether certain enhancements can be made. There were also various (legitimate) complaints about the missing documentation of the project.

Hence I thought I should continue to work on the project at least if time allows it. Here I would like to give an overview about the next development steps and the reasons for this. I’m afraid, I can’t give a time estimation as I’m not working on this project on a regular basis.

Well, the next steps will be:

  1. Finish the refactoring to become independent of the UI (support at least AWT/Sing and SWT)
  2. Get rid of the graph data structure and algorithms; replace it by JgraphT
  3. Tidy up the existing graph layout algorithms
  4. Tidy up the example applications and add new examples
  5. Enhance the Javadoc
  6. Relaunch the website, including a better documentation

As you can see, I start with the parts which are more fun to me and end up with the (usually boring, but necessary) documentation.

Independence of the UI framework

The existing Hypergraph code consists of three parts: packages related to the graph structure, a package for the hyperbolic geometry and a package for the visualisation of the hyperbolic graphs and hyperbolic trees in a JPanel. At first sight, this sounds like a reasonable architecture, but if you look at the code in detail, you will find that this is not the case, here some examples:

  • The colour attribute for edges and vertices are stored using java.awt.Colour. If you use this graph in frameworks which don’t use java.awt.Colour, this causes a problem.
  • The projector which is responsible for projecting the hyperbolic plane to the screen coordinates depends on JComponent
  • The animator which is responsible for smoothly translating the hyperbolic plane uses javax.swing.Timer.

If the code shall be used by different UI frameworks, these dependencies have to be resolved, which is the first aim. Large parts are already done and committed to the SWT branch, but there is still something to do.

Replacement of the Graph API by JGraphT

When I started in 2002/2003, I decided to write my own graph API. By now, there are some established projects written in Java which offer the functionality I need and I decided to use JGraphT.

The advantage is that I don’t need to worry about the (correct) implementation of any algorithms or data structure – there is no need to reinvent the wheel.

There are disadvantes though: I have to rewrite any code that depend on the existing graph data structure, which is quite a lot by the nature of the project. Also there doesn’t seem to be any reader for GraphML or GraphXML files, so need to implement them.

Tidying up the layout algorithms

The most striking issue with the hyperbolic layout algorithms is the poor quality of the code. I have to confess that I can’t read the code at all…

But there are some other deficiencies: the layout algorithms are graph layout algorithms. Well, that sounds tautologic, but for example the force directed layout is actually a generic multi dimensional scaling algorithm which could also be applied to other data than graphs. It will be interesting to see whether hyperbolic geometry can also be used for other visualisation besides hyperbolic trees and graphs.

This will happen more or less at the same time as the introduction of JGraphT as this is related.

Tidying up the example applications

Similar to the layout algorithms, the code quality of the applications is quite poor – the programs are a mixture of examples and proof of concept. Even the main applet to show hyperbolic trees should be tidied up.

I haven’t thought yet about which examples and which actual applications I should include in the distribution. If you have any proposals or if you would like to have an example for a specific concept in Hypergraph, please add a comment.

Enhancement of  the Javadoc and relaunching the website

The need to enhance the Javadoc should be obvious to anybody who has worked with the code. The same applies to the website: it’s not complete at all.

Currently, it’s difficult or at least not comfortable to add new content to the website due to the way the html files are generated and deployed.

As I have also spent some time working with CouchDB, I might decide to use CouchDB as a storage and web server and create some mixture of blog and wiki for the new website.

Conclusion

As you can see, there is a lot to do and I hope to provide a new release soon which covers at least the first two issues. I will also try to write about how the issues are resolved, i.e. describe the new architecture of Hypergraph.

As a last remark, let me mention that I’m not very glad about the project’s name Hypergraph. When I created the project, I thought this is a nice combination of hyperbolic geometry and graphs. However, I completely ignored that hypergraph is actually a fixed notion in graph theory which can lead to confusion. I wonder whether there is some other name which hints at the main focus of the project: data visualisation using hyperbolic geometry. If you have any ideas, please add a comment.