Distributed Capture & Document Capture

Distributed Capture & Document Capture

Capture is only a part of the ECM universe, but a crucial part nonetheless. Once a document is captured into an Enterprise Content Management system, it must be stored, perhaps put into a workflow process, archived, and made available for retrieval. Retrieval is in many ways the main thrust of an ECM system (no point putting it in there if you can’t ever see it again); retrieval is dependent on the index values associated it with it, which brings us back to capture.

Capture is the process of getting documents (and their data) into the system. Distributed Capture is the mechanism by which documents from a variety of locations (near and far) enter the system. The easiest way to do this is to utilize the file system. When different offices (or locations — work from home, anyone?) of a company are on the same network, specific locations on the shared file system can be designated for various purposes. Different directories can be used to input different kinds of documents.

I thought we were going to be paperless by now

This type of taxonomy works okay for existing electronic documents (Word files, spreadsheets, PDFs, etc); but what about hard-copy? The seemingly ubiquitous paper which exists in our so-called paperless office? Well, it needs to be scanned in. You want documents classified in a consistent manner, and the metadata (index values and other interesting info about the document) as accurate and as consistent as possible.

Consistency is key. When setting up a company-wide ECM system, it is a a key success indicator that everybody to follow the same set of procedures and guidelines involved in getting documents into the system. This can be accomplished by having a distributed capture system available.

The company I work for makes and sells a distributed capture system today. As we go through our roadmap discussions for where we want to take the product to solve customers’ future problems, we developers have have to grapple with some fundamental issues, mainly, what is the best technology to use as a platform.

It’s easy to imagine using the web to provide distributed document capture throughout your enterprise. You have centrally managed web servers. Everyone has a web browser on their computer (and cell phone, for that matter). In fact, anyone who’s ever attached a document using an html-based email program has already exercised the base technology necessary for a distributed capture system. One key advantage of Distributed Capture is that you get rid of paper at the source; take a moment to think about the implications of that. It’s okay, I’ll wait.

What else is needed…
There are two main improvements to simply uploading a document by way of a web page. One is the acquisition of the paper document, the other is the user-experience and business process to build into the hosting program. I’ll go into the physical acquisition in a later post, but the user-experience of a distributed capture system has to provide two things to be successful. It must be Dead Simple to Use and it must provide the functionality necessary to get good data into the system.

Our checking with users shows again and again that a single button is an attractive interface, with more functionality exposed as needed. One key question developers raise is what technology to build the interface in?

Technology Pros Cons
HTML Standards compliant, supported by all browsers. Primarily a static user interface. AJAX can add some Zing to the interface, but is problematical in certain situations (back-button, anybody?)
Flash Ubiquitous; Flash player in something like 90% of all browsers. Began life as an animation scripting language, although ActionScript 3.0 is more sophisticated. IDE support is poor. Hard to get my head wrapped around the timeline model.
Silverlight Microsoft integration and toolset. Microsoft has an army of developers working on tools and technologies; big changes in how Microsoft handles internet computing are emerging. Current market adoption is a little slow. Microsoft talks the big talk about cross-platform now, but has a history of embracing, extending, then co-opting technology (in my opinion)
JavaFX Ubiquitous. Many very good VM’s out there. Java itself is well suited to backend, server-side development. UI is not Java’s strong-suit; AWT ring a bell?
Platform Specific Code Leverage native functionality, look and feel. Lots of code bases to implement and maintain. Cross-platform toolkits and libraries tend to dumb-down the functionality to the lowest-common denominator.

I’m sure anybody reading this has ideas of their own about the pros and cons of the platforms listed out, and perhaps other ideas to add to the list. I welcome your comments.

Share on Twitter

Windows SharePoint Services Tips

If you are working with the WSS 3 SDK, you may have noticed that some of the method names are confusing.  The reason has something to do with the legacy naming from the old SharePoint Team Services, but you can avoid the confusion by keeping the following terminology in mind when you are reading the WSS doc.

  • Site  = Site Collection
  • Web = Site
  • RootWeb =  Top Level Site

Below is a sample code that will hopefully clear this up.

SPWebService webService = SPWebService.ContentService;
SPWebApplicationCollection webAppColl = webService.WebApplications;

foreach (SPWebApplication webApp in webAppColl)
{
    Console.WriteLine("Web App Name = " + webApp.Name);
    SPSiteCollection siteColl = webApp.Sites;

    foreach (SPSite site in siteColl)
    {
        SPWeb web = site.RootWeb;
        Console.WriteLine("Top Level Site Title = " + web.Title);
     }
}

Hope this helps.

 

Phong Hoang

Development Manager

ImageSource, Inc.

Share on LinkedIn   Share on Twitter

One API To Rule Them All!

I have spent the last six years customizing, integrating and extending a menagerie of different ECM systems.   Each system has it’s own features, drawbacks, pitfalls and of course, APIs.  The APIs for the systems I work with come in all sorts of flavors; COM, .NET, Java and Web Services to name a few.  Every once in a while I ask myself, would it be possible to write a single library that encompasses all of these ECM systems, or at least, the most common ones?

Lets look at this in more details. For starters, we need to define the common features found in most ECM systems.

  • Create content.
  • Search for content.
  • Get content item.
  • Update content item.
  • Update content metadata.
  • Delete content item.

Every system implements each of these features a bit differently. For instance, to create a new revision of a content item, some systems require an explicit API call to do so.  Other systems will automatically create a revision of a content item if its metadata values match an existing item.  These nuiances need to be considered and fleshed out.

What language/technology should this uber library be implemented in?  The most flexible approach would be web services. This way we can make calls into our library from just about any modern language.  However, web services have some draw backs as well.  Uploading large (10MB+) files to a web service can be complicated.  What happens when you want to upload or download a 100MB, or 1GB file?

How should our library deal with licensing.  Some ECM systems use per seat licensing, while others user per processor.  If you are dealing with a system that uses per seat licensing, you will need to manage your number of connections.  If you use up too many licenses, end users could find themselves locked out of the system!

At this point I have only scratched the surface of my original question.  There are many more considerations to take into account before deciding to write a single library to access all of your ECM systems.  The conclusion I have come to, is that it is simply too complicated to wrap more than one ECM system’s API into a single library.

Tyson Magney
Senior Developer
ImageSource, Inc.

Share on Twitter

Follow

Get every new post delivered to your Inbox.