Thursday, February 21, 2008

DAG construction

After communication with Mike, I got the dag.k file. Then I implemented the construction of DAG workflow based on all jobs and their relationship in a workflow. In other words, I put together all this information in a large Karajan workflow by using DAG. Note: this work is done at client side, not at server side. After construction of the huge workflow, it can be submitted to server just as a small job. Then id of the new submitted workflow will be returned. Based on this workflow id, a user can query state of the workflow. In addition, the user can access output files of the workflow by using common HTTP GET request.

More details about DAG construction:
Assume that we have four jobs in a workflow: job1, job2, job3, job4.
And their dependencies are:
job1 -> job2 ( this means job2 depends on job1 )
job1 -> job3
job3 -> job4

job3 -> job5
job1 -> job5
These dependencies are represented in following graph:
sample_workflow 
Then constructed DAG workflow looks like this:
<project>
<include file="cogkit.k"/>
<include file="dag.k"/>
<discard>
    <dag>
        <node>
            <string>job1</string>            //Here is name of the job.
            <element>
                <quotedlist/>
               content of job1
            </element>
            <edges>
                <string>job2</string>         //Here, it describes that job1 is prerequisite of job2, job3 and job5
                <string>job3</string>
                <string>job5</string>
            </edges>
        </node>
        <node>
            <string>job2</string>
            <element>
                <quotedlist/>
                content of job2
            </element>
        </node>
        <node>
            <string>job3</string>
            <element>
                <quotedlist/>
                content of job3
            </element>
            <edges>
                <string>job4</string>
                <string>job5</string>
            </edges>
        </node>
        <node>
            <string>job4</string>
            <element>
                <quotedlist/>
                content of job4
            </element>
        </node>
        <node>
            <string>job4</string>
            <element>
                <quotedlist/>
                content of job4
            </element>
        </node>
    </dag>
</discard>
</project>

Karajan Workflow Formats: .k and .xml

In CoGKit, there are two supported formats in Karajan workflow -- .k and .xml.
Personally, I like xml because of its prevalence and openness. There are many handy tools which can process xml documents in various ways. Unfortunately, xml support in Karajan workflow is not comprehensive. In recent programming, I need to use element dag which is used to support Directed Acyclic Graph.
(1) At first, Karajan workflow supports DAG. But it does not give users a dag.xml file to import. Instead, just dag.k is provided. So I need to find a way to convert dag.k to dag.xml. Thank Mike and Gregor to help me get out of the struggle. Following method can be used to do the task:

cog-workflow -intermediate dag.k

Although following error pops up, I still can get the file I need (dag.xml).

Execution failed:
Variable not found: #channel#defs
        kernel:export @ dag.k, line: 44

Note: actually, what is generated is dag.kml not dag.xml. Currently, I assume they are the same because I have no better choice.

(2) In document I can find, sample workflows about DAG are written in .k format.
This is the unique resource I find helpful: http://wiki.cogkit.org/index.php/Java_CoG_Kit_Workflow_Guide#Direct_Acyclic_Graphs.
Once again, I used command cog-workflow to do this task. But the generated .kml(xml) file is lengthy and not appropriate for human readers. So, I decided to convert it manually by myself. I found this useful link(http://wiki.cogkit.org/index.php/Java_CoG_Kit_Karajan_Workflow_Reference_Manual_4.1.5) which describes both formats in detail.

How to access results?

Now, it is time to consider how to make users easily and conveniently access the results of their workflows. There are several questions here:
(1) How to track output files in Karajan workflows?

The first option is to analyze content of the Karajan workflow to figure out output files. For example, for element execute, attribute stdout indicates the name of output file.
<execute executable="/bin/date" stdout="thedate" host="gf1.ucs.indiana.edu" provider="GT2" redirect="false"/>
However, if we use this method to track all output files, it is difficult and time-consuming because it is possible that many elements generate output files. As a result, we must capture possible output files from all these elements.
Another option I can think of is kind of tricky. The newly submitted workflow is executed in a newly created directory. After execution, the files (except workflow file) in the directory are output files. This is the method I am using in my implementation.
(2) How to organize output files?
For the same workflow, we can categorize it based on different criteria. For example, we can categorize a workflow based on the date on which it is submitted, or the date on which it is completed... I would like to make use of workflow id and user id to categorize the workflows. All workflows submitted by a user belong to the same group which can be accessed by this user. Within these workflows, workflow id is used by the user to access a specified workflow. The id of every workflow belonging to a user is unique.
So, the directory layout may look like this:
users/user1/workflow_122/output_file1
users/user1/workflow_122/output_file2
users/user2/workflow_1/output_file1
...
(3) How can users access output files?
After talking with Marlon, I would like to provide RESTful interface by which users can retrieve output files. In my implementation, URLs to access output files look like this:
http://domain:port/resources/user_name/workflow_id/ This retrieves list of all output files for the corresponding workflow.
http://domain:port/resources/user_name/workflow_id/output_file This will retrieve the specified output file directly.

Friday, February 15, 2008

RESTful web services in Java

Recently, I read some articles about RESTful web services and I am looking for some java libraries which have great support for REST. REST is supported by Axis2. However, the support is very limited. First, document(http://ws.apache.org/axis2/1_3/rest-ws.html) of REST support in the official web site is horrible. The content is very very brief so that I have more questions and confusions than what have been addressed by that document. Support for REST in Axis2 relies on a new feature in WSDL2 which enables HTTP binding. Here is a good article about it:http://www.ibm.com/developerworks/webservices/library/ws-rest1/. However, HTTP binding doesn't enable programmers to implement a full REST style system. I did not dig into WSDL2 to get more knowledge about its HTTP binding. Here(http://wso2.org/blog/footballsoccerpainting/949) is an article from someone else who complains Axis2.
Then, I found that JCP published a Java API specification about RESTful web services. It is JSR-311(http://jcp.org/en/jsr/detail?id=311). That looks pretty good because now we have a standard about how to use REST in Java. Jersey(https://jersey.dev.java.net/) is reference implementation of the specification. Besides Jersey, Restlet(http://www.restlet.org/) is another implementation which provides more features. Note that the specification itself is still in beta phase.

I decided to try Jersey. First download and unpack it.
Sample code:

@Path("/")
public class RESTTest{
    @HttpContext UriInfo uriInfo;	
    
    @GET
    @ProduceMime("text/plain")
    public String getUserAll(){
    	return "You want to retrieve information about all users.";
    }
    
    @Path("{user}")
    public UserResource getUserInfoAsText(@UriParam("user") String userid){
    	return new UserResource(userid);
    }
}
Most readers have noticed that Java annotation is used frequently. I think this method is handy and convenient.

Thursday, February 07, 2008

Workflow Queue Support

In previous implementation, I separated jobs from workflows. A job is a small task which currently is written in Karajan workflow language. Here, it is kind of confusing. Karajan workflow language is used to represent a job. A workflow consists of jobs number of which is variable. A workflow queue consists of some workflows. Workflows in a workflow queue can be executed in arbitrary order, which means that these workflows are totally independent. However, jobs in a workflow are usually related so that they should be executed in a certain order. Essentially workflow queue and workflow do not have fundamental difference because we can convert between them. So, difference is in logical level.
Now, workflow queue is supported.

Workflow management panel
wf_management copy 

When a new workflow is added, a new tab is created.
When an existing workflow is removed, the associated tab is removed from the user interface.
wf_management2 copy 

When a user creates and adds a new job, it can add this job to any existing workflow. Because we have created a workflow called "new_workflow1", this workflow appears in the drop box.
wf_management3 copy

When a user switches to the workflow panel, all jobs belonging to the workflow will be displayed in a canvas. What's more, relationship between jobs is displayed as well.
wf_management4 copy

When a user wants to see details of all jobs in a workflow, he/she can click button "click to see all jobs" and then a pop-up window is displayed to show detailed information.
wf_management5 copy