Humphrey Sheil

Moving to humphreysheil.com

2015-06-05T14:48:00.000+01:00

After 11 (crikey!) years on blogger, I'm moving to a new home. Hopefully this refresh will encourage more blog posting..

Named Entity Recognition - short tutorial and sample business application

2014-10-06T16:23:00.002+01:00

A latent theme is emerging quite quickly in mainstream business computing - the inclusion of Machine Learning to solve thorny problems in very specific problem domains. For me, Machine Learning is the use of any technique where system performance improves over time by the system either being trained or learning.

In this short article, I will quickly demonstrate how an off the shelf Machine Learning package can be used to add significant value to vanilla Java code for language parsing, recognition and entity extraction. In this example, adopting an advanced, yet easy to use, Natural Language Parser (NLP) combined with Named Entity Recognition (NER), provides a deeper, more semantic and more extensible understanding of natural text commonly encountered in a business application than any non-Machine Learning approach could hope to deliver.

Machine Learning is one of the oldest branches of Computer Science. From Rosenblatt's perceptron in 1957 (and even earlier), Machine Learning has grown up alongside other subdisciplines such as language design, compiler theory, databases and networking - the nuts and bolts that drive the web and most business systems today. But by and large, Machine Learning is not straightforward or clear-cut enough for a lot of developers and until recently, its' application to business systems was seen as not strictly necessary. For example, we know that investment banks have put significant efforts applying neural networks to market prediction and portfolio risk management and the efforts of Google and Facebook with deep learning (the third generation of neural networks) has been widely reported in the last three years, particularly for image and speech recognition. But mainstream business systems do not display the same adoption levels..

Aside: accuracy is important in business / real-world applications.. the picture below shows why you now have Siri / Google Now on your iOS or Android device. Until 2009 - 2010, accuracy had flat-lined for almost a decade, but the application of the next generation of artificial neural networks drove the error rates down to a usable level for millions of users (graph drawn from Yoshua Bengio's ML tutorial at KDD this year).

Dramatic reduction in error rate on Switchboard data set post introduction of deep learning techniques.

Luckily you don't need to build a deep neural net just to apply Machine Learning to your project! Instead, let's look at a task that many applications can and should handle better - mining unstructured text data to extract meaning and inference.

Natural language parsing is tricky. There are any number of seemingly easy sentences which demonstrate how much context we subconsciously process when we read. For example, what if someone comments on an invoice: "Partial invoice (€100,000, so roughly 40%) for the consignment C27655 we shipped on 15th August to London from the Make Believe Town depot. INV2345 is for the balance.. Customer contact (Sigourney) says they will pay this on the usual credit terms (30 days).".

Extracting tokens of interest from an arbitrary String is pretty easy. Just use a StringTokenizer, use space (" ") as the separator character and you're good to go.. But code like this has a high maintenance overhead, needs a lot of work to extend and is fundamentally only as good as the time you invest into it. Think about stemming, checking for ',','.',';' characters as token separators and a whole slew more of plumbing code hoves into view.

How can Machine Learning help?

Natural Language Parsing (NLP) is a mature branch of Machine Learning. There are many NLP implementations available, the one I will use here is the CoreNLP / NER framework from the language research group at Stanford University. CoreNLP is underpinned by a robust theoretical framework, has a good API and reasonable documentation. It is slow to load though.. make sure you use a Factory + Singleton pattern combo in your code as it is thread-safe since ~2012. An online demo of a 7-class (recognises seven different things or entities) trained model is available at http://nlp.stanford.edu:8080/ner/process where you can submit your own text and see how well the classifier / tagger does. Here's a screenshot of the default model on our sample sentence:

Output from a trained model without the use of a supplementing dictionary / gazette.

You will note that "Make Believe Town" is classified (incorrectly in this case) as an ORGANIZATION. Ok, so let's give this "out of the box" model a bit more knowledge about the geography our company uses to improve its' accuracy. Note: I would have preferred to use the gazette feature in Stanford NER (I felt it was a more elegant solution), but as the documentation stated, gazette terms are not set in stone, behaviour that we require here.

So let's create a simple tab-delimited text file as follows:

Make Believe Town LOCATION

(make sure you don't have any blank lines in this file - RegexNER really doesn't like them!)

Save this one line of text into a file named locations.txt and place it in a location available to your classloader at runtime. I have also assumed that you have installed the Stanford NLP models and required jar files into the same location.

Now re-run the model, but this time asking CoreNLP to add the regexner to the pipeline.. You can do this by running the code below and changing the value of the useRegexner boolean flag to examine the accuracy with and without our small dictionary.

Hey presto! Our default 7-class model now has a better understanding of our unique geography, adding more value to this data mining tool for our company (check out the output below vs the screenshot from the default model above)..

Code

package phoenix;

import java.util.ArrayList;
import java.util.List;
import java.util.Properties;

import org.junit.Test;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import edu.stanford.nlp.ling.CoreLabel;
import edu.stanford.nlp.ling.CoreAnnotations.NamedEntityTagAnnotation;
import edu.stanford.nlp.ling.CoreAnnotations.SentencesAnnotation;
import edu.stanford.nlp.ling.CoreAnnotations.TextAnnotation;
import edu.stanford.nlp.ling.CoreAnnotations.TokensAnnotation;
import edu.stanford.nlp.pipeline.Annotation;
import edu.stanford.nlp.pipeline.StanfordCoreNLP;
import edu.stanford.nlp.util.CoreMap;

/**
* Some simple unit tests for the CoreNLP NER (http://nlp.stanford.edu/software/CRF-NER.shtml) short
* article.
*
* @author hsheil
*
*/
public class ArticleNlpRunner {

private static final Logger LOG = LoggerFactory.getLogger(ArticleNlpRunner.class);

@Test
public void basic() {
LOG.debug("Starting Stanford NLP");

// creates a StanfordCoreNLP object, with POS tagging, lemmatization, NER, parsing, and
Properties props = new Properties();
boolean useRegexner = true;
if (useRegexner) {
props.put("annotators", "tokenize, ssplit, pos, lemma, ner, regexner");
props.put("regexner.mapping", "locations.txt");
} else {
props.put("annotators", "tokenize, ssplit, pos, lemma, ner");
}
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);

// // We're interested in NER for these things (jt->loc->sal)
String[] tests =
{
"Partial invoice (€100,000, so roughly 40%) for the consignment C27655 we shipped on 15th August to London from the Make Believe Town depot. INV2345 is for the balance.. Customer contact (Sigourney) says they will pay this on the usual credit terms (30 days)."
};
List tokens = new ArrayList<>();

for (String s : tests) {

// run all Annotators on the passed-in text
Annotation document = new Annotation(s);
pipeline.annotate(document);

// these are all the sentences in this document
// a CoreMap is essentially a Map that uses class objects as keys and has values with
// custom types
List sentences = document.get(SentencesAnnotation.class);
StringBuilder sb = new StringBuilder();

//I don't know why I can't get this code out of the box from StanfordNLP, multi-token entities
//are far more interesting and useful..
//TODO make this code simpler..
for (CoreMap sentence : sentences) {
// traversing the words in the current sentence, "O" is a sensible default to initialise
// tokens to since we're not interested in unclassified / unknown things..
String prevNeToken = "O";
String currNeToken = "O";
boolean newToken = true;
for (CoreLabel token : sentence.get(TokensAnnotation.class)) {
currNeToken = token.get(NamedEntityTagAnnotation.class);
String word = token.get(TextAnnotation.class);
// Strip out "O"s completely, makes code below easier to understand
if (currNeToken.equals("O")) {
// LOG.debug("Skipping '{}' classified as {}", word, currNeToken);
if (!prevNeToken.equals("O") && (sb.length() > 0)) {
handleEntity(prevNeToken, sb, tokens);
newToken = true;
}
continue;
}

if (newToken) {
prevNeToken = currNeToken;
newToken = false;
sb.append(word);
continue;
}

if (currNeToken.equals(prevNeToken)) {
sb.append(" " + word);
} else {
// We're done with the current entity - print it out and reset
// TODO save this token into an appropriate ADT to return for useful processing..
handleEntity(prevNeToken, sb, tokens);
newToken = true;
}
prevNeToken = currNeToken;
}
}

//TODO - do some cool stuff with these tokens!
LOG.debug("We extracted {} tokens of interest from the input text", tokens.size());
}
}
private void handleEntity(String inKey, StringBuilder inSb, List inTokens) {
LOG.debug("'{}' is a {}", inSb, inKey);
inTokens.add(new EmbeddedToken(inKey, inSb.toString()));
inSb.setLength(0);
}

}
class EmbeddedToken {

private String name;
private String value;

public String getName() {
return name;
}

public String getValue() {
return value;
}

public EmbeddedToken(String name, String value) {
super();
this.name = name;
this.value = value;
}
}

Output

16:01:15.465 [main] DEBUG phoenix.ArticleNlpRunner - Starting Stanford NLP

Adding annotator tokenize

TokenizerAnnotator: No tokenizer type provided. Defaulting to PTBTokenizer.

Adding annotator ssplit

edu.stanford.nlp.pipeline.AnnotatorImplementations:

Adding annotator pos

Reading POS tagger model from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [1.5 sec].

Adding annotator lemma

Adding annotator ner

Loading classifier from edu/stanford/nlp/models/ner/english.all.3class.distsim.crf.ser.gz ... done [6.6 sec].

Loading classifier from edu/stanford/nlp/models/ner/english.muc.7class.distsim.crf.ser.gz ... done [3.1 sec].

Loading classifier from edu/stanford/nlp/models/ner/english.conll.4class.distsim.crf.ser.gz ... done [8.6 sec].

sutime.binder.1.

Initializing JollyDayHoliday for sutime with classpath:edu/stanford/nlp/models/sutime/jollyday/Holidays_sutime.xml

Reading TokensRegex rules from edu/stanford/nlp/models/sutime/defs.sutime.txt

Reading TokensRegex rules from edu/stanford/nlp/models/sutime/english.sutime.txt

Oct 06, 2014 4:01:37 PM edu.stanford.nlp.ling.tokensregex.CoreMapExpressionExtractor appendRules

INFO: Ignoring inactive rule: null

Oct 06, 2014 4:01:37 PM edu.stanford.nlp.ling.tokensregex.CoreMapExpressionExtractor appendRules

INFO: Ignoring inactive rule: temporal-composite-8:ranges

Reading TokensRegex rules from edu/stanford/nlp/models/sutime/english.holidays.sutime.txt

Adding annotator regexner

TokensRegexNERAnnotator regexner: Read 1 unique entries out of 1 from locations.txt, 0 TokensRegex patterns.

16:01:38.077 [main] DEBUG phoenix.ArticleNlpRunner - '$ 100,000' is a MONEY

16:01:38.080 [main] DEBUG phoenix.ArticleNlpRunner - '40 %' is a PERCENT

16:01:38.080 [main] DEBUG phoenix.ArticleNlpRunner - '15th August' is a DATE

16:01:38.080 [main] DEBUG phoenix.ArticleNlpRunner - 'London' is a LOCATION

16:01:38.080 [main] DEBUG phoenix.ArticleNlpRunner - 'Make Believe Town' is a LOCATION

16:01:38.080 [main] DEBUG phoenix.ArticleNlpRunner - 'Sigourney' is a PERSON

16:01:38.081 [main] DEBUG phoenix.ArticleNlpRunner - '30 days' is a DURATION

16:01:38.081 [main] DEBUG phoenix.ArticleNlpRunner - We extracted 7 tokens of interest from the input text

There are some caveats though - your dictionary needs to be carefully selected to not overwrite the better "natural" performance of Stanford NER using its' Conditional Random Field (CRF)-inspired logic augmented with Gibbs Sampling. For example, if you have a customer company called Make Believe Town Limited (unlikely, but not impossible), then Stanford NER will mis-classify Make Believe Town Limited to Make Believe Town. However, with careful dictionary population and a good understanding of the target raw text corpus, this is still a very fruitful approach.

Summary

In summary, a robust natural language parser with integrated Named Entity Recognition like the Stanford NLP libraries used here provide a strong base to build from for business applications needing more powerful text analysis, particularly in conjunction with approaches like gazettes that allow the overlay of business terms to improve the accuracy of the vanilla model.

SCEA study guide errata

2012-09-30T12:39:00.000+01:00

Thanks to Kristiyan Marinov who sent in this comment calling out three typos in the book. I figure this is interesting enough to other readers of the book currently working through the exam to re-publish the comment in full below along with my reply..

Original comment

Hi,

I recently read your book and find it an interesting and helpful read. I found a couple of mistakes during some of the test-yourself questions though.
Since I didn't manage to get a hold of you or Mark Cade any other way, I'll be posting my notes in this comment.

Typo 1: On page 81, question 6. The Answer says D but the explanation below it explains why C is the correct answer (which it is indeed).

Typo 2: On page 95, question 1. The Answer says C and D but the correct answers are B and C, as is given in the explanation.

Typo 3: On page 148, question 5. The Answer says B, C and F while the actual answers are B, D and F.

Thanks for the otherwise great book!

Kristiyan

My reply

Hi Kristiyan

Thanks for the comments. These are all typos that need to go into the errata for the book. I'll reach out to the publshers to see when the next run is and also create a blog post to call out these three for other readers. Thanks!

Humphrey

Fixing the Google Analytics API (v3) examples

2012-07-08T20:59:00.002+01:00

I'm currently working on consuming data from the Google Analytics API for one of the data sources we're using at Eysys.

So far, so normal. But what was strange was the poor state of the Google Analytics API documentation. I don't think I've ever seen one of their APIs be documented so poorly - missing source code, typos in example source code provided, a real rambling tone to the docs pointing off to different areas (a lot of this is to do with the OAuth 2.0 hoops you have to jump through before even starting to pull down the data you want to analyse).

I also couldn't believe how many dependencies the API has - I ended up with 29 jar files in my Eclipse project's lib folder! Surely this could all be a lot leaner, meaner and easier - it's just an API returning JSON data at the end of the day..

Anyway, if you're interested in getting up and running with the API examples, here's what to fix.

First of all, there is actually missing source code in the distro itself (or at least the one I used - google-api-services-analytics-v3-rev10-1.7.2-beta.zip), so you need to get LocalServerReceiver, OAuth2Native and VerificationCodeReceiver directly from the Google code repo.

LocalServerReceiver uses an old version of Jetty, so we need to migrate it to use the latest Jetty (I used 8.1.4.v20120524) which now has an org.eclipse.* package structure. So we need to update the imports as follows:

import org.eclipse.jetty.server.Connector;
import org.eclipse.jetty.server.Request;
import org.eclipse.jetty.server.Server;
import org.eclipse.jetty.server.handler.AbstractHandler;

Refreshing Jetty will also necessitate upgrading the handle(..) method to fit in with the new signature, as follows:

@Override
public void handle(String target, Request arg1,
HttpServletRequest request, HttpServletResponse response)
throws IOException, ServletException {
if (!CALLBACK_PATH.equals(target)) {
return;
}
writeLandingHtml(response);
response.flushBuffer();
((Request) request).setHandled(true);
String error = request.getParameter("error");
if (error != null) {
System.out.println("Authorization failed. Error=" + error);
System.out.println("Quitting.");
System.exit(1);
}
code = request.getParameter("code");
synchronized (LocalServerReceiver.this) {
LocalServerReceiver.this.notify();
}
}

There's also a small mod to be made in getRedirectUri(), change this line:

server.addHandler(new CallbackHandler());

to this instead:

server.setHandler(new CallbackHandler());

The logic of this class also seems pretty flawed to me - generating a random port for the redirect URI that OAuth calls back to at the end of authentication every time it's run, which by definition you won't be able to put into the APIs console. So I commented out the getUnusedPort() method and simply hard-coded one.

And after these mods, hey presto it works! :-)

Writing a new book about software development!

2012-05-11T14:56:00.000+01:00

Following a respectable interval of time after the launch of the official Enterprise Architect study guide (which was absolutely necessary to allow painful memories of the writing and editing process to fade :-) ), I've teamed up with my editor from that book - Greg Doench from Pearson on a new book about software development. I can't believe that Greg is signing up for round two with me, and am grateful to have him on board again!

The central premise of this new book stems from an observation that I have seen time and time again - a lot of smart people in business that I work with just don't get software at any kind of meaningful level - the coders who program it, their unique culture, the actual process of designing and writing software, and most importantly - why things (inevitably) go wrong and how to fix them when they do (and for "wrong", you can substitute any value of "late", "over-budget" or "doesn't do what it should" or all three that floats your boat).

This disconnect would be ok if it weren't for the fact that these same smart business people almost always end up in a position where software projects are a key part of what they need to achieve - they become customers, or key stakeholders. If they rise high enough, they become actual budget holders - then it gets interesting! Simply put, it's rapidly becoming a career-limiting move in business to say that "I'm not technical". And motivated business people who want to become conversant with their software projects are finding a gap in readable, digestible content that helps them to bridge their gap in understanding. That's where this book comes in.

The book structure itself is pretty new - although the chapters are designed to be read together (although not in a regimented order), Greg is encouraging me to write the chapters so that they will also read "standalone". There's a strong chance then that individual chapters will be available well before the book is scheduled to complete in Q2 of 2013.

What the book is:

* A guide to software development for people who are not technical by background and want to learn
* A map to navigate a software project by - regardless of programming language used or target application
* A guide that should stand the test of time - it's not about buzzwords, it's about the core building blocks that make up software projects

What the book isn't:

* An idiot's guide to software development - you will be stretched intellectually by the content we're planning to put into the book
* A technology-specific guide - I'll be consciously writing the book with a view to covering all technologies, and concrete examples will be provided across the spectrum of commonly-used programming languages

Time to start writing!

Awesome Java developers wanted!

2012-03-21T15:34:00.000+00:00

We're hiring! If you've got a hankering to work for a startup operating in stealth mode based in Cardiff, Wales, using the latest Java frameworks and techniques to build the coolest ecommerce platform around, read on!

ABOUT THE OPPORTUNITY

Once in a while, a chance comes up to be part of something special. We are looking for excellent developers to join our team and help build the next-generation global ecommerce platform incorporating big data mining and analysis, machine learning, cloud computing (EC2 and App Engine) and the latest advances in online commerce.

WHO WE ARE LOOKING FOR

* Our platform is primarily Java-based, so strong Java and OOD skills are an absolute must

* A willingness to pro-actively research and use new libraries and projects as needed to add new platform capabilities to complete our roadmap

* Experience with Linux and MySQL is also advantageous

* You will have a solid grounding in how web based server side applications and databases work

* Be comfortable working in a rapid iteration development cycle moving from prototype to production while engineering to a high level of quality, using leading automated testing techniques

* Enjoy / understand the importance of working in all layers of the platform architecture - UI, business logic and persistence

* Understand how to describe and design a system in terms of data structures and algorithms, in order to participate effectively in core design workshops

* Be totally committed to writing the most efficient, scalable and robust code possible, and to continously improve your ability in this area

* Prior experience with Bayesian techniques and artificial neural networks is beneficial, but is not strictly necessary

* Prior experience with Hadoop and HBase is beneficial, but is not strictly necessary

* Most important of all.. where you don't know something, be happy and ready to roll your sleeves up and learn it!

ABOUT US

Our culture is to work hard using the latest and most relevant technologies and to have lots of fun while doing it! We believe passionately in building and delivering truly game-changing software to our customers. Our ideal candidates are self-starting, good communicators, love coding and work well in a team.

For more information and to submit a CV, please email careers@eysys.com.

To all recruitment agencies: eysys does not accept agency CVs. Please do not forward CVs to our jobs alias, eysys employees or any other company location. eysys is not responsible for any fees related to unsolicited CVs.

NoSQL / NewSQL / SQL - future-proofing your persistence architecture (part one)

2011-08-07T22:59:00.001+01:00

Although its been a few years in the making, the noise / buzz around NoSQL has now reached fever pitch. Or to be more precise, the promise of something better / faster / cheaper / more scalable than standard RDBMSs has sucked in a lot of people (plus getting to use MapReduce in an application even if it's not needed is a temptation very hard to resist..). And pretty recently, the persistence hydra has grown another head - NewSQL. NewSQL adherents essentially believe that NoSQL is a design pig and that a better approach is to fix relational databases. In turn, NewSQL claims have been open to counter-claim on the constraints inherent in the NewSQL approach. It's all very fascinating (props for working Lady Gaga into a technical article as well..).

As it turns out, traditional RDBMSs are sometimes slow for valid reasons, and while you can certainly speed things up by relaxing constraints or optimising heavily for a specific use case, that's not a panacea or global solution to the problem of a generic, fast way to store and access structured data. On the other hand, the assertion that Oracle, MySQL and SQL Server have become fat and inefficient because of backwards compatibility requirements definitely strikes a chord with me personally.

The sheer variety of NoSQL candidates (this web page lists ~122!) is evidence that the space is still immature. I don't have a problem with that (every technology goes through the same cycle), but it does raise one nasty problem: what happens if you back the wrong candidate now in 2012 that has disappeared in 2015?

The current NoSQL marketplace demands a defensive architecture approach - it's reasonable to expect that over the next three years some promising current candidates will lose momentum and support, others will merge and still others will be bought up by a commercial RDBMS vendor, and become quite costly to license.

What we need is a good, implementation-independent abstraction layer to model the reading and writing from and to a NoSQL store. No hard coding of specific implementation details into multiple layers of your application - instead segregate that reading and writing code into a layer that is written with change in mind - we're talking about pluggable modules, sensible use of interfaces and design patterns to make the replacement of your current NoSQL squeeze as low-pain as possible if and when that replacement is ever needed.

If the future shows that the current trade-offs made in the NoSQL space (roughly summed up as - a weaker take on A(tomicity),C(onsistency), I(solation) or D(urability), plus with your own favourite blend of Brewer's CAP theorem) are rendered unnecessary by software and hardware advances (as is very likely to be the case), then the API should ideally insulate our application code from this change.

There are interesting moves afoot that demonstrate that the community is actively thinking about this, specifically the very recent announcement ) of UnQL (the NoSQL equivalent to SQL - i.e. a unified NoSQL Query Language). That's good, but UnQL is young enough to shrivel and die just like any of the NoSQL implementations themselves. Also, we know that what has inspired UnQL - SQL - is itself fragmented / with vendor-specific extensions like T-SQL from Microsoft and PL/SQL from Oracle.

So then, in part one of this two-parter, I've worked to justify what's coming in part two - a minimal set of Java classes and interfaces to provide a concrete implementation of the abstract ideas discussed above.

New Google Analytics location report doesn't like Connacht so much..

2011-07-31T19:27:00.000+01:00

The new UI for Google Analytics has a distinctly Cromwellian vibe to it, as the screenshot below shows. Is this just my GA account, or does everyone else see Galway and Sligo a bit more surrounded by the Atlantic than normal?

Umbraco on Azure - take 1.5 (not 2)!

2011-07-06T14:24:00.001+01:00

Back in August of last year I wrote a step-by-step article on how to get Umbraco running on Windows Azure (the Microsoft cloud computing platform). It got a lot of hits from people looking to do just exactly that.

There were a few loose ends in that piece, notably not using shared rather than VM-local storage to allow for Umbraco clustering and also not using the .NET 4.0 runtime rather than .NET 3.5 (4.0 was a recent addition to Azure in Aug 2010 and it just didn't work out of the box - missing sections in the machine.config).

So a follow-up article has been on my to-do list for a while now to tie up these loose ends, and then I found this - the Umbraco Accelerator for Windows Azure.

I have no idea how good / bad it is, but it's a great idea and well worth a road test if you're looking to use Umbraco in production with Azure.

Seems to be still active since it's initial release in Oct 2010 with a point release put out there in mid-June.

It also appears to be part of a wider plan to standardise how ASP.NET applications can be moved to Azure in a standard way (the Windows Azure Accelerators project), again a good thing IMHO.

Let me know how it works for you.

Google IO 2011 Day Two recap

2011-05-14T02:38:00.000+01:00

Oh the perils of making predictions when there is still a conference keynote to go!

It turns out that Chrome OS and the associated hardware hasn't been read the last rites after all. Rather, v1.0 is almost ready for primetime (scheduled for release in mid-June - about a month away). You have to imagine over time though that Google will want one code base for phones, tablets and chromebooks. At the very least, they will want to make it as easy as possible for developers to write their applications once and have them "just work" on devices with radically different screen sizes and input methods, something that Android developers today are already doing. Nonetheless, a very brave play, especially in targeting the enteprise space, where significant replacement costs exist. If it pays off, it will be huge.

Moving on from Chrome, a couple of sessions I attended yesterday were really interesting, specifically two - Full Text Search and Smart App Design.

Full Text Search is Google's take on Lucene / Solr and integrated into the App Engine Datastore as well, so it will be compelling for developers who just want to start indexing and scoring documents quickly. The "fully automatic" mode of operation with the Datastore should also be a timesaver.

Smart App Design covered material of a completely different color. I had already read about the Prediction API in the blogosphere but I hadn't realised exactly what it did until this session. Essentially, Google offers the discerning developer the ability to add machine learning techniques to their application by leveraging a cloud-based service.

At first glance, I had thought that the API gave access to the same model that Google uses to predict search terms, and I guess that is one use case. But Google has done much more than that - they have effectively white-labelled their machine learning technology and made it available to non-Google developers to use with their own data, i.e. learn what's important for their application / business.

As with all machine-learning techniques, the nub of the matter remains the correct selection and efficient representation of the key attributes in the training set, and that is quite simply a problem that requires deep domain knowledge. One announcement yesterday was quite interesting however, in that Google are now allowing good model authors to sell their models to others. So if I come up with a model that predicts shopping basket behavior on leisure travel websites and a tour operator used that to bump their online conversion rate by 33%, then that model has a lot of value and it's a win-win situation for the model author and the model user.

So an API with a lot of promise. But also with two potential flies in the ointment, one commercial and one cultural:

(a) Commercial - Google are trying to charge for use of the API from day one, this will stymie adoption in the earliest stage

(b) Cultural - an endemic problem with a lot of machine learning techniques is their black box nature. As someone who spent a fair bit of time working with artificial neural networks at university, quite often a machine learning approach will yield the correct answer but the researcher can't exactly explain why! That's not a Google-specific weakness, but what is Google-specific is that the modules you access via the Prediction API (the man behind the curtain if you will) is not made open at all, so can a company really invest time in building, training and using models that they don't really understand and can never hope to do so? Only time will tell.

So to recap then, Google IO was definitely worth attending this year - and not just for the hardware gifts! The main items on my research list post the event are:

1. Google Go running on App Engine

2. The Prediction API

3. Full Text Search enhancements / module for App Engine

4. Adding my own hooks and content into Google Maps and Street View to greatly enhance what the end user sees when they access Maps from my site

5. Fusion tables + Charting - a good / cheap way to rapidly slice and dice data and provide good interactive widgets to visualize same to end users.

Google IO 2011 (#io2011) - day one recap

2011-05-11T05:36:00.001+01:00

The official Google code site has the lowdown on all of the announcements that came thick and fast today (some 11 major items last time I checked and plenty of API revs and upgrades) and I won't replay them all here.

Specific announcements that interested me today:

Google Go is about to become an officially supported language on App Engine, alongside Python and Java (it's currently in "Trusted Tester" mode).

Rhetorical question: what value does a complete end-to-end technology stack with no overhanging IPR issues or blockers have to Google as a potential insurance policy in case the Oracle lawsuit does not go in their favor / be settled reasonably? Two things I heard today convinced me that there is now serious engineering investment going into Go (as opposed to a small, talented team cranking things out as they work down the list):

(a) The afore-mentioned App Engine support (this won't have been trivial to implement - Go is the first compiled language to run on App Engine after all for one thing)

(b) The info that a "comprehensive" Go library for ultimately all of the Google APIs is in development and will be with us "soon".

Go is a very nice language to write in, and the App Engine support announced today addresses one of the major gaps I identified when I took a look at Go when it was first released in Nov 2009.

Three final comments on day one:

1. Press articles I read in March / April this year about the +1 button being a make or break deal for Google to compete with Facebook seem overblown. The +1 button has merited just one session so far and apart from that you wouldn't even know Google had it. Either that or the memo didn't make it to the IO organisers in time.

2. It's instructive to watch Google see the mistake that companies like Sun Microsystems made and impressive to watch how they studiously avoid it. It's not enough to develop great code / software / hardware - you have to have people **using** it. Google's continued push into content ensures that usage. Google is not just the place you go to find content on the web, it's also where you consume that content (first youtube, but now books, movies and music too). I'm glad Google don't have a social network offering in their portfolio of services - they would be simply too powerful if they did.

3. Google IO seems to be **all** about Android so far - it's absolutely everywhere you look and consumed the entire keynote this morning (Ice Cream in Q4 that unifies tablet and phone, Futures (Android @ Home), open accessories etc.). Barring some crazy and unforeseen announcement tomorrow, I'd say Chrome OS has been given the last rites internally. But then again, who knows what day two will bring?

A vision for big data in leisure travel ecommerce

2011-03-06T20:10:00.000+00:00

[This is an article for people working in leisure travel technology / ecommerce online conversion who visit this blog, although many of the take-home points are transferable to other industry verticals.]

Data is big, and getting bigger. The more we track and log, the more storage is needed to warehouse it, and the more CPU horsepower is needed to mine it to answer questions posed by the business. As an aside, everyone is facing this issue and it's sink or swim, with the swimmers sure to get a competitive advantage over the sinkers. In this article, I'll examine the main data feeds that matter in leisure travel, and propose an architecture to collect, manage and mine them for business benefit. The end goal is to propose a vision, explaining why and how to collect data to better inform and drive business decisions that improve ecommerce performance.

But why now - hasn't this always been an issue? Yes, but now more than ever, leisure travel is poised on the cusp of another big game-changer. Companies like Google and Microsoft are clearly already focusing more on travel as a segment, and their data gathering and mining capabilities are considerable. But tour operators and online travel agencies (OTAs) have a significant competitive advantage over pure play technology companies as we'll see a little later.

Important data sources in leisure travel ecommerce

First, let's examine the primary data sources that affect leisure travel ecommerce. There are some obvious entries in the table that follows, and some less so.

ID	Name	Internal / External	Controllable	Purpose / Comment
1	Availability (internal)	Internal	Yes	Stock (internal, at-risk / committed inventory) available to sell, down to room type / meal plan / cabin and fare class
2	Pricing (internal)	Internal	Yes	Pricing for internal stock. Entire teams stay focused on this source, ensuring it is (a) competitive, and (b) profitable
3	Availability (external)	External	Yes	Stock available to sell contracted through third parties (usually not committed stock), down to room type / meal plan / cabin and fare class. Usually used to plug gaps in internal stock (resort coverage, star rating, price band etc.). Sources include GDSs, bed banks, car rental companies etc.
4	Pricing (external)	External	Yes	Pricing for third party stock.
5	Rich content (internal)	Internal	Yes	Provide compelling, unique, accurate text, images and video to convince the consumer to buy
6	Rich content (external)	External	Usually	Provide compelling, unique, accurate text, images and video to convince the consumer to buy. Needs to be differentiated otherwise your search engine ranking score will suffer due to duplicate content penalties.
7	Attributes	Both	Yes	Attributes (aka facets) are becoming increasingly important - star rating, price bands, family-friendly (has a creche, rooms are adjoining), "has a", "is a", "is close to" - attributes provide consumers with a more intelligent and targeted search capability
8	User generated content	External	No	Tripadvisor is the poster child here, but user generated content (UGC) can be in-house too - but it must be perceived as unbiased by the consumer, otherwise it becomes a negative.
9	Meta data	Both	Yes	Every business tags its own data - timestamps, version numbers, # revisions, author, approver, when last yielded. The more meta data you have the merrier - it often helps to tie disparate data sources together and enriches the overall data pool
10	Search, cost, book funnel	Internal	Yes	Traditionally the core of any ecommerce strategy - measures the complete search, cost and book journey. Needs to be fully instrumented to collect data so that A/B and multivariate testing can be used to fine-tune performance over time. Google Analytics does this very, very well.
11	Offline (shop) interactions	Internal	Yes	Few businesses try to tie shop activity back to online activity, but for a bricks and mortar plus clicks business, this is an opportunity missed
12	Online advertising (SEO)	Internal	Partially	SEO can be thought of as PPC you don't pay for! Critical to making cost of acquisition online as efficient as possible. Only partially controllable due to businesses being at the mercy of search engine scoring (which both Google and Microsoft (Bing) keep as a black box algorithm)
13	Online advertising (PPC)	Internal	Yes	Where Google makes its money!.. PPC has pride of place in every well-constructed ecommerce campaign, but the cost and effectiveness should be continuously monitored, challenged and tuned. CSV exports out of AdWords provide a good way to do this
14	Personalisation	Internal	Yes	Personalisation - both anonymous and known, is a great way to learn what kind of holiday / vacation people want to buy from you and how they want to find and buy it. Just don't try to build personalisation before you have (10) working well - personalisation needs a really solid foundation to work well..
15	Social media	External	No	The rising star that no-one really knows how to handle. The Facebook API contains a lot of potential for travel ecommerce
16	Offline / traditional advertising	External	Yes	The efficacy (or not) of ad spend must extend to traditional / offline as well as the more easily measurable online variant, otherwise you don't know where all of your marketing £s / $s / €s are going
17	Post-booking interactions	Internal	Yes	ecommerce data source, but savvy businesses are now looking at post-booking amendments, cancellation rates etc. to identify patterns that can feed back into the search experience
18	Customer Relationship Management (CRM)	Internal	Yes	Both pre and post travel - it's key to have a good view of what the customer experiences on holiday and feed that back into what holidays are sold going forward. Is that picture of the pool misleading - change it! If the service is great, promote it more!

Table 1. A proposed taxonomy on data sources that impact and influence leisure travel ecommerce.

Two important characteristics of data are whether you control it or not (and hence can change it if you need to) and whether it is sourced from an internal system or an external system (and thus how trustworthy / accurate the data is and whether it is unique to you or if other business entities can see it too). We have added these two characteristics to the table above for clarity.

What should be obvious to the reader is that a holistic picture of ecommerce performance requires multiple data sources, some of which traditionally would not be seen as impacting the effectiveness of a leisure travel ecommerce system. Gone are the days of simply looking at the web logs to see how effective (or leaky) the conversion funnel is! In fact, there are probably some sources that I've inadvertently omitted, and indeed as new systems come on stream, new sources will be added to this table / taxonomy.

Finally, it's interesting from a barrier to entry perspective to note that only the well-placed tour operator or OTA actually has the wherewithal and access to collate data from all of the sources noted in the table. Other new entrants simply do not have access to many of the sources listed. The data itself is now a valuable commodity (and is increasing in value), and an asset that leisure travel businesses would do well to guard jealously.

What we need - Systems and Data working together

At present, I contend that the average tour operator / OTA is collecting some, but not all of the data sources identified, and that no tour operator or OTA has yet constructed a system that provides a holistic, joined-up view of the data back to the business function to inform decision-making activities. Why not? Because it's not easy to do! The IT estate behind these data sources is fragmented (core res system, yielding system, multiple content management systems, external systems, separate booking repositories / agency management systems, Google Analytics, Google AdWords, Excel spreadsheets), often owned by different companies and wasn't designed to provide with the kind of view that is now needed. Ominously, new entrants into the space do not have a lot of the legacy baggage that incumbents do, meaning their velocity of implementation and ongoing change creates a hard-to-ignore imperative for all sellers of leisure travel to innovate quickly and learn from their data, or be left behind.

The technical challenge is four-fold:

1. Collection and storage - gather and store as much data as possible for each data source in the table, with that data being as clean and structured as possible (and in the real world, every data set will have some noise to it)

2. Build a holistic, joined-up data set - identify ways to link the data sources together - version number, unique keys, foreign keys, link backs, tagging etc. The more your data sources are joined up, the more holistic a view of the business you are building (and can provide back to the business). Conversely, disconnected data sets (data islands) are of much less value to the business and introduce the risk of an incomplete / inaccurate view of what's really happening now being used to influence what's going to happen next

3. Answering the questions - provide a mechanism to answer questions over this corpus of data in near real-time to allow the business to modify its behaviour and focus to maximise profits, yield and margin

4. Suggesting the questions - once the above three points have been implemented to a mature and repeatable level, the final logical step is for the data function to actually suggest areas of improvement and further exploration based on emergent patterns in the data, using techniques such as artificial neural network and self-organising maps (SOM) analysis

Putting it all together - a suggested framework

There are many ways to construct a view over the data sources identified in the previous section. And in fact, multiple views are encouraged depending on the goal of the business. Here however, a hybrid of time and business function is selected in order to select a reasonable framework to hold the data. This framework is depicted in the following diagram.

Figure 1. High-level schematic of the big data system for leisure travel ecommerce.

A concrete implementation of the framework

The question naturally arises - how would this system be constructed, not just initially but also maintained and extended going forward?

Some natural candidates already exist, chief among them Cassandra and Hadoop. In the author's opinion, a hybrid architecture of Cassandra's data storage and innate simplicity and high availability, coupled with the MapReduce framework from Hadoop offers the best blend of performance, scalability, availability / resilience, querying and extensibility. A separate follow-on instalment to this article is warranted to provide a detailed technical treatise on the underpinnings of the system outlined here.

Conclusion

The dominant data sources that impact the effectiveness of a leisure travel ecommerce strategy are identified, named and classified. Developing this classification further, a model is used to create a framework to house the data sources and a concrete implementation suggested.

About the author: Humphrey is the Chief Technology Officer for Comtec Group, a company that specializes in leisure travel technology.

JDK 7 preview and JEE 7 planning

2011-03-02T20:44:00.000+00:00

We got two interesting developments in Java land this week:

1. Oracle released the developer preview of the Java 7 Development Kit (JDK)

2. Oracle have started talking publically about what JEE 7 (and beyond - JEE 8) will look like in Q3 2012 and Q4 2013.

(1) has been a long time coming and it's good to see the log jam moving. Simply shipping JDK 7 is good in its own right but it also means that the team will move onto working on JDK 8, which contains some key language features omitted from JDK 7 so that the team could JGIOTFD (Just Get It Out The (reader exercise to complete the acronym)).

(2) looks to be Oracle really making the JEE stack cloud-based / cloud-friendly by default rather than a technology stack that merely facilitates cloud computing. This dynamic should see Oracle formalising exactly what constitutes "JEE in the cloud" via a JSR and thus wresting that intellectual responsibility back from Google's App Engine platform, which is pretty much the de facto standard for "JEE in the cloud" at present.

Looking beyond JEE 7, JEE 8 looks to be embracing Big Data / NoSQL systems like Hadoop and Cassandra, although we can expect to have seen significant consolidation in this space by 2013, making the integration and platform support task easier to accomplish.

All in all, two nice moves, and good news for the Java eco system / economy. You might or might not like Oracle, but they are getting stuff out the door in a way that Sun kind of forgot how to do.

Oracle Certified Enterprise Architect - JEE 6 refresh update

2011-02-22T08:38:00.000+00:00

The JEE 6 SCEA exam / certification

Following on from my earlier post requesting input into my next post, here's an often-requested update: what's happening with the Oracle Certified Master, Java EE 5 Enterprise Architect exam / certification update to JEE 6 standard?

In a nutshell, here's where it is (covering each of the three parts in turn):

Parts two and three of the exam (the practical elements) will remain very similar to how they operate today - these elements test your ability to design and document (part two) a solution to a well-defined business problem using the JEE platform and then challenge you (part three) to self-critique and justify key design decisions taken, especially on how non-functional requirements will be adequately satisfied. Parts two and three are pretty much independent of the current JEE revision, because the candidate is given a good degree of latitude in how you use JEE to solve the problem. Were you to use J2EE 1.4 features let's say, then the examiner is going to question the logic of that decision closely, but that's about it. Writing Ruby code and then having it compile to Java bytecodes at runtime using JRuby is also not recommended (don't laugh, someone did ask..)!

Part one of the exam (the multiple-choice exam) **will** change for JEE 6 - it has to because part one is more tightly coupled to a specific JEE revision - currently JEE 5 (with ~5% of J2EE 1.4 content).

The last time we revised part one, ~ten architects got together in Broomfield, CO for a week to design and critique the corpus of questions used. After that, Sun Microsystems (as they were then), brought in some external testing folks to benchmark the exam and to critique the overall marking strategy we intended to employ. That was an intense week and overall a fairly involved process, because you want to write difficult, tricky questions that will challenge an architect but at the same time, be fair. Part one of the architect exam is also not allowed to test your ability to memorize APIs or specifications - that is the primary task for the lower certifications. You very quickly find that a lot of difficult / tricky questions in JEE revolve around the APIs and specifications!

I think with the benefit of hindsight we erred on the side of fairness over toughness. I think we'll look to toughen up the questions for JEE 6.

I don't expect Oracle to reconvene the team of architects to do this refresh - the last refresh of the exam was a major refresh whereas we would consider this refresh to be more minor. Therefore the time taken to update should be shorter. Once the part one refresh is scheduled in, I'll post again on this topic. For now, the JEE 5 architect exam remains the most current and up to date architect exam you can take.

What would you like to read about next?

2011-01-23T20:49:00.000+00:00

I've been pondering what next to write about and thought - why not ask the readers? So here's your chance!

Turns out that most people visiting / watching this blog fall into four camps (in no order of priority):

* Want to know more about Enterprise Java architecture / software architecture in general

* Want to know more about the Oracle Certified Enterprise Architect exam for the Java platform (I'm a co-author of the study guide for this exam as well as a co-lead assessor)

* Want to know more about .NET (especially running Umbraco on Windows Azure and / or MVC 3)

* Want to know more about ecommerce tracking (measuring, then improving online conversions)

At least, that's what the web tracking software gods say! There's a great mix of visitors too from all corners of the globe, but the next post will be in english I'm afraid.

So if there's a specific topic relating to the categories above that you'd really like to see covered, drop me a note at hsheilblog@gmail.com and I'll do my best to - and may the best suggestion win!

Ecommerce: online conversion - simple model and toolset

2011-01-09T21:27:00.001+00:00

Readers of this blog can wax lyrical on how to build a great B2C ecommerce site - either in JEE or .NET. First we get the technology stack right, then frameworks using that technology stack, comprehensive functional and technical specs, testing plans, coding standards + reviews with daily scrum meetings, hardware / cloud estimation and then load / penetration testing - this is bread and butter to the software architect.

What a lot of software architects don't understand (or underestimate) is what needs to happen to their site after it goes live. After the go-live of a B2C ecommerce site, a whole other team (which is fairly non-technical) takes it over. This team is really exercised by and focused on three core goals:

1. Get qualified visitors to the site as cost-effectively as possible

2. Enable those visitors to find the product they want quickly and easily

3. Convert the visitor into a customer - convince them to buy on your site

These goals are completely measurable in monetary terms, and hence you will find senior management taking a serious interest in them as well.

I work in leisure travel, and there are some very specific nuances to achieving these goals in my industry sector (every industry sector will have their own nuances). But there is also a generic model to be found and some very useful (and free!) tools that you can use to put the model in place.

Turns out the model is pretty simple. Essentially it consists of three components:

1. Analytics - where we measure what's happening on our target site - how is the user interacting with the site and can we infer what they do and don't like based on measuring and studying those interactions

2. Hypothesis testing (aka A/B and / or multivariate testing) - Analytics will give us lots of data to generate ideas on how to improve interactions, therefore we need a mechanism to test out hypotheses in a semi-automated way (if I change X, I bet the conversion rate will increase by Y%)

3. Efficient prospect capture - we want the best native SEO score possible on all of the search engines and when we spend money on ad campaigns, we want the best return for that investment.

So that's the high-level model - it's pretty simple.

Many companies (and especially Google), make an awful lot of money around online ecommerce. And that's where the "free!" I noted above comes in. It makes sense for Google to give away the tools enabling Analytics (1) and Hypothesis testing (2) for free, as they make so much revenue on selling ad campaigns in Efficient Prospect Capture (3). Unkind souls might claim that if you spend any kind of money with Google AdWords at all, then you're not really getting (1) or (2) for free, but you won't find a nefarious cheap shot like that on this blog.

Let's look at how we can implement the model then:

1. Analytics - use Google Analytics. Brian Clifton's book is an excellent treatise on the application, and the online training videos are of a high standard as well. It's well worth having a couple of developers on your team get Analytics certified to understand what the tool can do - it really is very powerful

2. Hypothesis (A/B, multivariate) testing - use Google Website Optimizer. There's less information about this tool, I guess because it's a bit simpler than Analytics, but a good overview is available. Being able to change content and see the impact on the fly is a key part of the model - that's why we use a CMS like Umbraco!

3. Efficient prospect capture - SEO, SEO and more SEO. The Art of SEO is a great read. My opinion here is that as long as you're doing a great job on your own SEO, you should begrudge a search engine every penny. By using tagging in conjunction with Google Analytics (make sure you associate your AdWords account with your Analytics account to get all this done for you automagically), you can continually check that your ROI on ad campaigns is worth the spend, and stop buying terms that don't make money.

And that's pretty much it. A three-component generic model for online ecommerce, followed by the simplest (with zero cost) way to implement that model for your B2C site. I intimated that each industry sector has its own quirks and foibles above and beyond this base model, and I'll focus on the leisure travel industry in more detail in a future post or two. For now, enjoy!

What I do

2010-12-28T17:33:00.003+00:00

(I can be contacted by email or twitter).

I design and build software solutions that address business needs in the simplest possible way. I'm comfortable operating at the nexus of technology and commerce - bridging the gap between the software / hardware teams and the business drivers and key stakeholders, right up to board level.

Currently I'm the Chief Technology Officer for Eysys (we're hiring by the way!). At Eysys we're using big data combined with machine learning to build a next generation ecommerce platform, with baked-in intelligence to optimise conversion and make efficient use of marketing spend.

In a previous life, I was the Head of Data Engineering and Infrastructure at the Thomas Cook Online Travel Agency, using Master Data Management and big data analysis to drive platform conversion and performance.

Previous to that (sheesh!), I was the Chief Technology Officer for Comtec Group - building end to end systems for clients in the leisure travel industry, primarily in the UK and US. I led the definition and construction of our travel suite, from fast loading of inventory (e.g. Hotel, Air, Transfers etc.) through GDS selection and with a particular focus on ecommerce. In the ecommerce world we helped our customers to measure and increase online conversion rates, optimize PPC spend, increase SEO scores and overall consumer engagement. We leveraged analytics, A/B with multivariate testing and personalization techniques, to name just a few tools and techniques in the kit bag.

Before Comtec I worked for a financial services company as a software architect and before that again I worked as a consultant for a well-known business and IT consulting company.

In 2000, I became an external examiner and subject matter expert for the Java Enterprise Architect accreditation from Sun Microsystems - now Oracle. I have presented at JavaOne and written numerous articles on many different aspects of software engineering. In 2010, I co-authored the definitive official study guide to the SCEA exam itself.

I am deeply rooted in Computer Science - I have a particular interest in distributed systems and hold a B.Sc (1998 - First Class Honours) and M.Sc (2002) in Computer Science from University College Dublin. My M.Sc. thesis focused on building a high-throughput grid-like compute engine using Java and Artificial Neural Networks to solve a well-known bioinformatics problem (protein secondary structure prediction).

Early Xmas Cloud presents from Microsoft, Google..

2010-12-03T00:06:00.000+00:00

Just about 48 hours apart, Microsoft and Google have released significant updates for their Azure and App Engine cloud offerings just in time for Christmas.

The 1.4.0 App Engine SDK addresses some long-criticised weaknesses, in particular not being able to keep an instance ready to rock and roll at all times plus the ability to execute long-running requests (> ten seconds). The ole App Engine has been getting a bit of a kicking recently in the blogosphere so this is a timely release (assuming the unplanned outages have been sorted out in parallel with this). There's nothing in the release notes about a more SQL-like persistence store like SQL Azure, so you still need to wrap your head around Google's Datastore and the pros and cons it gives you.

The 1.3 Azure SDK also addresses some weaknesses in Azure, in particular now allowing developers to actually RDP onto their Azure boxen in the cloud, a really big improvement on the current state of affairs (basically you get a headless box with non-straightforward access to log files via the Windows Azure Diagnostics service).

It's interesting how these SDK releases are solidifying the differences between these two cloud offerings - Google are zeroing in on providing a PaaS model, where you have to code in a supported programming language (currently either Java or Python - wonder when they will support Google Go?) against a locked-down set of APIs, where Microsoft are moving more towards an IaaS model where you do what you like cos it's more or less your box. Both approaches have their strengths and weaknesses, the overall ecosystem is stronger for having both.

The curious case of Oracle, the JDK and plan B (aka the prune juice plan)

2010-09-27T10:39:00.002+01:00

Mark Reinhold (Chief Architect of the Java Platform Group at Oracle), posted a Plan A and B approach (just like a classic A/B ecommerce conversion test eh?!) for the JDK roadmap in advance of the annual Java love fest that is JavaOne in San Francisco last week. For me, this was the biggest item I was looking for - the time gap between JDK 6 and 7 has been ridiculous.

From his "Re-thinking JDK 7" post, the options proposed are:

<snip>

Plan A: JDK 7 (as currently defined) Mid 2012

Plan B: JDK 7 (minus Lambda, Jigsaw, and part of Coin) Mid 2011
JDK 8 (Lambda, Jigsaw, the rest of Coin, ++) Late 2012

</snip>

I am firmly in favour of the option eventually selected - option B. It's clear that the JDK has a huge feature log jam. Selecting option B is like giving the JDK release schedule a big dose of prune juice - you know something's gonna start moving.

So to understand what Plan B means for you as a Java architect, I suggest that it can be broken down into these four steps.

1. Read the negative comment to a further post by Mark announcing the decision - this comment represents why you would be unhappy with Plan B. I reproduce it here for the lazy reader (not you, the other guy):

"Hi Mark,

To me, "JDK 7 minus Lambda, Jigsaw and part of Coin" doesn't sound much like "Getting Java moving again" :-(

This schedule is very disappointing.

Posted by Cedric on September 08, 2010 at 10:06 AM PDT"

2. Read the response to the negative comment to understand what Plan B entails. Again, reproduced here:

"JDK 7 - (Lambda + Jigsaw + part of Coin) = Most of Coin + NIO.2 (JSR 203) +
InvokeDynamic (JSR 292) + "JSR 166y" (fork/join, etc.) + most everything else
on the current feature list (http://openjdk.java.net/projects/jdk7/features/) +
possibly a few additional features TBD.

Posted by Mark Reinhold on September 08, 2010 at 10:26 AM PDT"

The TBD bit is a tad ambiguous - let's ignore it by assuming nothing major is going to get in now, given the sheer volume of regression and platform testing needed before a JDK hits gold / GA status.

3. So now you know Project Coin is the biggie for JDK 7 - therefore you need to download presentation for same from this year's JavaOne 2010 session on Coin (119 slides, but a lot of these are just slides bitching about how hard it is to do, seminal slides are 10 and 23 - 66). Try-with-resources (Automatic Resource Management) looks great - equivalent to C#'s using keyword. Enhanced exception handling will enable better code as well.

4. [Optional, for the dedicated reader] Some more light bedtime reading - follow the links from the JDK 7 roadmap, especially for Project Lambda (closures) and Jigsaw (modular Java). This will then get JDK 8 on your forward-looking radar.

Now the **real** question is what will JEE 7 look like?!

Umbraco CMS - complete install on Windows Azure (the Microsoft cloud)

2010-08-29T14:50:00.015+01:00

We use the Umbraco CMS a lot at work - it's widely regarded as one of (if not the) best CMSs out there in the .NET world. We've also done quite a bit of R&D work on Microsoft Azure cloud offering and this blog post shares a bit of that knowledge (all of the other guides out there appear to focus on getting the Umbraco database running on SQL Azure, but not how to get the Umbraco server-side application itself up and running on Azure). The cool thing is that Umbraco comes up quite nicely on Azure, with only config changes needed (no code changes).

So, first let's review the toolset / platforms I used:

* Umbraco 4.5.2, built for .NET 3.5

* Latest Windows Azure Guest OS (1.5 - Release 201006-01)

* Visual Studio 2010 Professional

* Azure SDK 1.2

* SQL Express 2008 Management Studio

* .NET 3.5 sp1

Step one is simply to get Umbraco running happily in VS 2010 as a regular ASP.NET project. The steps to achieve this are well documented here. Test your work by firing up Umbraco locally, accessing the admin console and generating a bit of content (XSLTs / Macros / Documents etc.) before progressing further. (The key to working efficiently with Azure is to always have a working case to fall back on, instead of wondering what bit of your project is not cloud-friendly).

Then use these steps to make your Umbraco project "Azure-aware" . Again, test your installation by deploying to the Azure Dev Compute and Storage Fabric on your local machine and testing that Umbraco works as it should before going to production. The Azure Dev environment is by no means perfect (see below) or a true synonym for Azure Production, but it's a good check nonetheless.

Now we need to use the SQL Azure Migration Wizard tool to migrate the Umbraco SQL Express database. I used v3.3.6 (which worked fine with SQL Express contrary to some of the comments on the site) to convert the Umbraco database to its SQL Azure equivalent - the only thing the migration tool has to change is add a clustered index on one of the tables (dbo.umbracoUserLogins) as follows - everything else migrates over to SQL Azure easily:

CREATE CLUSTERED INDEX [ci_azure_fixup_dbo_umbracoUserLogins] ON [dbo].[umbracoUserLogins]

(

[userID]

)WITH (IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF)

Then create a new database in SQL Azure and re-play the script generated by AzureMW into it to create the db schema and standing data that Umbraco expects. To connect to it, you'll replace a line like this in the Umbraco web.config:

with a line like this:

<add key="umbracoDbDSN" value="server=tcp:<<youraccountname>>.database.windows.net;database=umbraco;user id=<<youruser>>@<<youraccount>>;password=<<yourpassword>>" />

So we now have the Umbraco database running in SQL Azure, and the Umbraco codebase itself wrapped using an Azure WebRole and deployed to Azure as a package. If we do this using the Visual Studio tool set, we get:

19:27:18 - Preparing...

19:27:19 - Connecting...

19:27:19 - Uploading...

19:29:48 - Creating...

19:31:12 - Starting...

19:31:52 - Initializing...

19:31:52 - Instance 0 of role umbraco452_net35 is initializing

19:38:35 - Instance 0 of role umbraco452_net35 is busy

19:40:15 - Instance 0 of role umbraco452_net35 is ready

19:40:16 - Complete.

Note the total time taken - Azure is deploying a new VM image for you when it does this, it's not just deploying a web app to IIS, so the time taken is always ~ 13 minutes, give or take. I wish it was quicker..

Final comments

If you deploy and it takes longer than ~13 minutes, then double check the common Azure gotchas. In my experience they are:

1. Missing assemblies in production - so your project runs fine on the Dev Fabric and just hangs in Production on deploy - for Umbraco you need to make sure that Copy Local is set to true for cms.dll, businesslogic.dll and of course umbraco.dll so that they get packaged up.

2. Forgetting to change the default value of DiagnosticsConnectionString in ServiceConfiguration.cscfg (by default it wants to persist to local storage which is inaccessible in production - you'll need to use an Azure storage service and update the connection string to match, e.g. your ServiceConfiguration.cscfg should look something like this:

<?xml version="1.0"?>

</ConfigurationSettings>

</Role>

</ServiceConfiguration>

You also need to run Umbraco in full-trust mode, otherwise you will get a security exception when Umbraco tries to read files that are not inside its own "local store" as defined by the .NET CAS (Code Access Security) sub system running on the production Azure VM. In other words, you need the enableNativeCodeExecution property set to true in your ServiceDefinition.csdef like so:

<?xml version="1.0" encoding="utf-8"?>

<ServiceDefinition name="UmbracoCloudService" xmlns="http://schemas.microsoft.com/ServiceHosting/2008/10/ServiceDefinition">
  <WebRole name="umbraco452_net35" enableNativeCodeExecution="true">
   <InputEndpoints>
   <InputEndpoint name="HttpIn" protocol="http" port="80" />
   </InputEndpoints>
   <ConfigurationSettings>
   <Setting name="DiagnosticsConnectionString" />
   </ConfigurationSettings>
  </WebRole>
</ServiceDefinition>

The Azure development tools (Fabric etc.) are quite immature in my opinion - very slow to start up (circa one minute) and simply crash when you've done something wrong rather than give a meaningful error message and then exit (for example, when trying to access a local SQL Server Express database (which is wrong - fair enough), the loadbalancer simply crashed with a System.Net.Sockets.SocketException{"An existing connection was forcibly closed by the remote host"}. I have the same criticism of the Azure production system - do a search to see how many people spin their wheels waiting for their roles to deploy with no feedback as to what is going / has gone wrong. Azure badly needs more dev-friendly logging output.

I couldn't get the .NET 4.0 build of Umbraco to work (and it should, .NET 4.0 is now supported on Azure). The problem appears to lie in missing sections in the machine.config file on my Azure machine that I haven't had the time or inclination to dig into yet.

You'll also find that the following directories do not get packaged up into your Azure deployment package by default: xslt, css, scripts, masterpages. To get around this quickly, I just put an empty file in each directory to force their inclusion in the build. If these directories are missing, you will be unable to create content in Umbraco.

Exercises for the reader

* Convert the default InProc session state used by Umbraco to SQLServer mode (otherwise you will have a problem once you scale out beyond one instance on Azure). Starting point is this article - http://blogs.msdn.com/b/sqlazure/archive/2010/08/04/10046103.aspx, but google for errata to the script - the original script supplied does not work out of the box.

* Use an Azure XDrive or similar to store content in one place and cluster Umbraco.

Using Ninject as your Dependency Injection container in ASP.NET MVC 3

2010-08-18T22:36:00.004+01:00

MVC 3 Preview 1 has been available for a few weeks now from Microsoft, with Preview 2 scheduled for release sometime next month.

As a web development framework, MVC 3 is pretty cool - simple to set up and start using, with a terse, clean syntax courtesy of the new Razor view engine. Coupled with Entity Framework 4 (supporting both code-first generation of database schemas and wrapping existing database schemas), MVC 3 + EF 4 has the makings of a very good web development stack.

If you're interested in using Ninject as the Dependency Injection (DI) container in MVC 3, then you'll find the code below interesting - I couldn't find this anywhere else on the web so ended up writing it. It's the required implementation of the System.Web.Mvc.IMvcServiceLocator that gets instantiated and used in the Application_Start method in Global.asax.cs.

Using DI with MVC 3 makes a lot of sense - we use it to decouple concrete implementations from the interface that we code against so that we can quickly swap in alternate implementations, e.g. a quick, self-contained in-memory database for unit testing using Moq or similar.

This link from Brad Wilson shows how to set up Microsoft Unity as the dependency injection container and this presentation from Phil Haack gives a fleeting, tantalising glimpse of how the Ninject equivalent might look but there's nowhere to get the complete code you need to get it working!

So I put the two together in order to use Ninject as my DI container. Here's the code (with zero comments as per my normal coding standard):

using System.Web.Mvc;
using System;
using System.Collections.Generic;
using Ninject;

namespace AdminApp.Models
{

public class NinjectMvcServiceLocator : IMvcServiceLocator
{
public IKernel Kernel { get; private set; }

public NinjectMvcServiceLocator(IKernel kernel)
{
Kernel = kernel;
}

public object GetService(Type serviceType)
{
try
{
return Kernel.Get(serviceType);
}
catch (Ninject.ActivationException e)
{
throw new System.Web.Mvc.ActivationException("PAK", e);
}
}

public IEnumerable<tservice> GetAllInstances<tservice>()
{
try
{
return Kernel.GetAll<tservice>();
}
catch (Ninject.ActivationException e)
{
throw new System.Web.Mvc.ActivationException("PAK", e);
}
}

public IEnumerable<object> GetAllInstances(Type serviceType)
{
try
{
return Kernel.GetAll(serviceType);
}
catch (Ninject.ActivationException e)
{
throw new System.Web.Mvc.ActivationException("PAK", e);
}
}

public TService GetInstance<tservice>()
{
try
{
return Kernel.Get<tservice>();
}
catch (Ninject.ActivationException e)
{
throw new System.Web.Mvc.ActivationException("PAK", e);
}
}

public TService GetInstance<tservice>(string key)
{
try
{
return Kernel.Get<tservice>(key);
}
catch (Ninject.ActivationException e)
{
throw new System.Web.Mvc.ActivationException("PAK", e);
}
}

public object GetInstance(Type serviceType)
{
try
{
return Kernel.Get(serviceType);
}
catch (Ninject.ActivationException e)
{
throw new System.Web.Mvc.ActivationException("PAK", e);
}
}

public object GetInstance(Type serviceType, string key)
{
try
{
return Kernel.Get(serviceType, key);
}
catch (Ninject.ActivationException e)
{
throw new System.Web.Mvc.ActivationException("PAK", e);
}
}

public void Release(object instance)
{
try
{
Kernel.Release(instance);
}
catch (Ninject.ActivationException e)
{
throw new System.Web.Mvc.ActivationException("PAK", e);
}
}

}
}

And here's how to instantiate and use it in Global.asax.cs:

var kernel = new StandardKernel(new NinjectRegistrationModule());
var locator = new NinjectMvcServiceLocator(kernel);
MvcServiceLocator.SetCurrent(locator);

Finally, here's a sample NinjectRegistrationModule which maps the implementation I want onto the generic interface that my code consumes:

using Ninject.Modules;
using AdminApp.Controllers;

namespace AdminApp
{
class NinjectRegistrationModule : NinjectModule
{
public override void Load()
{

Bind<ISpecialRepository>().To<DbSpecialRepository>().InRequestScope();

}
}
}

Effect of the SCEA study guide on the exam

2010-07-23T14:50:00.003+01:00

The SCEA study guide book - especially chapter nine - is already having an effect on the exam. And that effect is interesting, mostly positive but with some negatives as well.

In general, it is fair to say that the overall standard of submissions has improved, and a lot of submissions clearly contain cues from chapter nine of the book - naming conventions, diagram layout, adoption of the server A and B spec approach for the deployment diagram - it's all there in a lot of submissions.

The book has made some of the submissions more anodyne / bland / standardized, which in turn makes me a little sentimental for the past. There's nothing like trying to traverse a crazy class diagram late at night for keeping your brain sharp!

In my opinion, a small but not insignificant percentage of candidates (a bit less than 10%) actually end up submitting a **worse** assignment under the influence of the book, and for a very interesting reason. If you buy the book and read it and aren't an architect, then you will have an incomplete understanding of the concepts covered within it. By extension, when you apply the book material to your submission, there is a very good chance that you will make mistakes that are pretty glaring. So the book will make your submission worse, not better.

As a corollary, if you buy the book and really get the material, your application of that new-found material on top of your already substantial knowledge and skills will result in a strong submission.

In summary then, the book is not a magic book.

The interesting medium / long-term question is whether or not the exam should always have a pass rate of X% and a fail rate of Y% or if it is acceptable to have X approach 100% as a result of the book (that's not happening but clearly it could).

Book - feedback so far

2010-03-27T16:42:00.002+00:00

The book has just gone back to the printers for a second run. Apparently the first print run (a few thousand I think?) was chewed up by Amazon and direct pre-orders. It's fantastic for that many people to have the book and I really hope it helps you in preparing for the exam.

So, the feedback so far: the reviews on Amazon (both .com and .co.uk) are for the old book, not the new one. Amazon just copied the reviews across (the last one was written two years before the new book published).

So all I've got to go on are comments that I've received directly. Broadly speaking, reviewers fall into two camps:

1. Those who like the ~200 page guide / map to a much larger body of research material (happy);

2. Those who want / expect to find all of the revision material in one book (not so happy).

Our goal was always to write a book that did not replicate the reams of material that exist for the JEE platform. We simply saw no point in doing that. Instead, we wanted to write a book that the candidate could use to:

1. Construct a revision schedule for Part One;

2. Understand how to approach Part Two - constructing your own solution for a given business problem using the JEE platform;

3. Prepare the candidate for Part Three - defending your Part Two submission and explaining how you solution satisfies various NFRs (non-functional requirements).

Broadly speaking, I think we've hit the goals we set. There is an errata list that will be sent to the publisher for the second print and will be published here as well for the purchasers of the first run.

SCEA book publication and shipment dates

2010-01-26T20:53:00.006+00:00

The book has gone to the printers! It comes off the press on Monday February 1st and gets to Pearson's warehouse on February 4th. From there it usually takes a week to get to Amazon (in the US). Here's the Amazon US and Amazon UK links. It's also available for the Kindle.

People who placed pre-orders for hard copy editions will receive their shipment first - shipped direct from Pearson's warehouse next week.

As far as the online edition goes, the Rough Cut disappears after the final update (which matches the printed book) and it then becomes part of the regular Safari Library and is accessible to all subscribers.

If these dates change, I'll put out another update. It will be fantastic to see the book finally out there!

Book - chapter nine available for download

2009-11-29T21:52:00.002+00:00

I've put a PDF copy of chapter nine up on www.box.net for download here. Chapter nine is two things - a seminal chapter in terms of the exam content it covers (Parts II and III of the SCEA exam) and also one that Safari Rough Cuts (SRC) keep missing out on in updating it. The version of the book on SRC is a lot older than this version. I'll keep this download link live until SRC is updated with the latest version. Enjoy.