Building Out Distributed Apps (Big Data)

Yesterday, I attended a webinar by O’Reilly on how to reduce the pain of building out distributed applications. The focus was on scalability, which makes sense, since this is why you would want to distribute your applications.

Apart from the host’s unfortunate resemblance to Little Lord Fauntleroy, there was some interesting observations to be made. To wit:

Engineers versus Ops

When there’s an issue affecting your customer in large systems, it is most likely an engineering issue, especially in emerging products. You need to staff up on Engineering talent for your projects at a much greater rate than Ops.

Data is not always relational

Data these days is more than OLAP stuff. Things being captured and crunched include data graphs, key-value pairs, etc. So, something non-SQL based might be called for as a datastore. Only a handful of SQL features are used in most large data projects. As the data sets get larger, SQL gets less useful.

Real-time versus Batch Processing

Something to consider. How is your data being created, in one-sy/two-sy fashion online, or in large grabs of data. This will affect your basic understructure.

Cost of Research

It is very easy to under-estimate the cost of research when moving into a new area. Executive management wants hard numbers to be able to plan and manage costs, but anybody who’s developed new systems knows that costs tend to be unpredictable because you just don’t know what you don’t know yet.

What is your experience involving Big Data and Distributed Applications?

Distributed Capture & Document Capture

Distributed Capture & Document Capture

Capture is only a part of the ECM universe, but a crucial part nonetheless. Once a document is captured into an Enterprise Content Management system, it must be stored, perhaps put into a workflow process, archived, and made available for retrieval. Retrieval is in many ways the main thrust of an ECM system (no point putting it in there if you can’t ever see it again); retrieval is dependent on the index values associated it with it, which brings us back to capture.

Capture is the process of getting documents (and their data) into the system. Distributed Capture is the mechanism by which documents from a variety of locations (near and far) enter the system. The easiest way to do this is to utilize the file system. When different offices (or locations — work from home, anyone?) of a company are on the same network, specific locations on the shared file system can be designated for various purposes. Different directories can be used to input different kinds of documents.

I thought we were going to be paperless by now

This type of taxonomy works okay for existing electronic documents (Word files, spreadsheets, PDFs, etc); but what about hard-copy? The seemingly ubiquitous paper which exists in our so-called paperless office? Well, it needs to be scanned in. You want documents classified in a consistent manner, and the metadata (index values and other interesting info about the document) as accurate and as consistent as possible.

Consistency is key. When setting up a company-wide ECM system, it is a a key success indicator that everybody to follow the same set of procedures and guidelines involved in getting documents into the system. This can be accomplished by having a distributed capture system available.

The company I work for makes and sells a distributed capture system today. As we go through our roadmap discussions for where we want to take the product to solve customers’ future problems, we developers have have to grapple with some fundamental issues, mainly, what is the best technology to use as a platform.

It’s easy to imagine using the web to provide distributed document capture throughout your enterprise. You have centrally managed web servers. Everyone has a web browser on their computer (and cell phone, for that matter). In fact, anyone who’s ever attached a document using an html-based email program has already exercised the base technology necessary for a distributed capture system. One key advantage of Distributed Capture is that you get rid of paper at the source; take a moment to think about the implications of that. It’s okay, I’ll wait.

What else is needed…
There are two main improvements to simply uploading a document by way of a web page. One is the acquisition of the paper document, the other is the user-experience and business process to build into the hosting program. I’ll go into the physical acquisition in a later post, but the user-experience of a distributed capture system has to provide two things to be successful. It must be Dead Simple to Use and it must provide the functionality necessary to get good data into the system.

Our checking with users shows again and again that a single button is an attractive interface, with more functionality exposed as needed. One key question developers raise is what technology to build the interface in?

Technology Pros Cons
HTML Standards compliant, supported by all browsers. Primarily a static user interface. AJAX can add some Zing to the interface, but is problematical in certain situations (back-button, anybody?)
Flash Ubiquitous; Flash player in something like 90% of all browsers. Began life as an animation scripting language, although ActionScript 3.0 is more sophisticated. IDE support is poor. Hard to get my head wrapped around the timeline model.
Silverlight Microsoft integration and toolset. Microsoft has an army of developers working on tools and technologies; big changes in how Microsoft handles internet computing are emerging. Current market adoption is a little slow. Microsoft talks the big talk about cross-platform now, but has a history of embracing, extending, then co-opting technology (in my opinion)
JavaFX Ubiquitous. Many very good VM’s out there. Java itself is well suited to backend, server-side development. UI is not Java’s strong-suit; AWT ring a bell?
Platform Specific Code Leverage native functionality, look and feel. Lots of code bases to implement and maintain. Cross-platform toolkits and libraries tend to dumb-down the functionality to the lowest-common denominator.

I’m sure anybody reading this has ideas of their own about the pros and cons of the platforms listed out, and perhaps other ideas to add to the list. I welcome your comments.

Share on Twitter

Make It Dead Simple

Make it Dead Simple

The whole point of ECM (Electronic Content Management) is to manage electronic content, meaning you have to have a way to put information in and to get it back out. You will also need a way to control (restrict or grant) access to the data. The data going in to storage must be findable again.

The success of your ECM solution is predicated on the validity of the metadata which goes in with it.Simply put, metadata describes the content you are storing in a way which allows you to find it again.

Back in the day, my fellow propeller-heads and I used to joke about Write-Only storage, meaning that data could be written to a disk, but never read, which of course renders it useless. Just as useless as Write-Only storage is content which is unfindable, or, just as bad, is data which matches too many criteria. This also makes it hard to use.

When getting content into the system, it is imperative that good, solid metadata is entered into the system along with the data.

What’s the best way to get good metadata?

  • Automated capture
    Grabbing data directly off the content being inserted. This can be scanned-in images, using Capture Software. Today, this software is getting quite sophisticated and can read handwritten data as well as recognizing printed text, specialized bar codes, and images
    If data is being inserted in electronic form, such as through web-services, there is likely already metadata associated with the content
  • User Input
    Sometimes you must let the users enter the data; in this case, you must keep it dead simple to be effective.
    Keep the user interface simple — only ask for the data you actually need
    Provide lookups for data to restrict the domain of possible results
  • Validation
    This adds a separate step, but can greatly increase the accuracy of the input.

Your suggestions?

 

Martin O. Waldron
Program Manager, SW Development
ImageSource, Inc.

Share on LinkedIn   Share on Twitter

Follow

Get every new post delivered to your Inbox.