Sunday, March 06, 2011

A vision for big data in leisure travel ecommerce

[This is an article for people working in leisure travel technology / ecommerce online conversion who visit this blog, although many of the take-home points are transferable to other industry verticals.]

Data is big, and getting bigger. The more we track and log, the more storage is needed to warehouse it, and the more CPU horsepower is needed to mine it to answer questions posed by the business. As an aside, everyone is facing this issue and it's sink or swim, with the swimmers sure to get a competitive advantage over the sinkers. In this article, I'll examine the main data feeds that matter in leisure travel, and propose an architecture to collect, manage and mine them for business benefit. The end goal is to propose a vision, explaining why and how to collect data to better inform and drive business decisions that improve ecommerce performance.

But why now - hasn't this always been an issue? Yes, but now more than ever, leisure travel is poised on the cusp of another big game-changer. Companies like Google and Microsoft are clearly already focusing more on travel as a segment, and their data gathering and mining capabilities are considerable. But tour operators and online travel agencies (OTAs) have a significant competitive advantage over pure play technology companies as we'll see a little later.

Important data sources in leisure travel ecommerce

First, let's examine the primary data sources that affect leisure travel ecommerce. There are some obvious entries in the table that follows, and some less so.

IDName Internal / External Controllable Purpose / Comment
1Availability (internal) Internal Yes Stock (internal, at-risk / committed inventory) available to sell, down to room type / meal plan / cabin and fare class
2Pricing (internal) Internal Yes Pricing for internal stock. Entire teams stay focused on this source, ensuring it is (a) competitive, and (b) profitable
3Availability (external) External Yes Stock available to sell contracted through third parties (usually not committed stock), down to room type / meal plan / cabin and fare class. Usually used to plug gaps in internal stock (resort coverage, star rating, price band etc.). Sources include GDSs, bed banks, car rental companies etc.
4Pricing (external) External Yes Pricing for third party stock. 
5Rich content (internal) Internal Yes Provide compelling, unique, accurate text, images and video to convince the consumer to buy
6Rich content (external) External Usually Provide compelling, unique, accurate text, images and video to convince the consumer to buy. Needs to be differentiated otherwise your search engine ranking score will suffer due to duplicate content penalties.
7Attributes Both Yes Attributes (aka facets) are becoming increasingly important - star rating, price bands, family-friendly (has a creche, rooms are adjoining), "has a", "is a", "is close to" - attributes provide consumers with a more intelligent and targeted search capability
8User generated content External No Tripadvisor is the poster child here, but user generated content (UGC) can be in-house too - but it must be perceived as unbiased by the consumer, otherwise it becomes a negative.
9Meta data Both Yes Every business tags its own data - timestamps, version numbers, # revisions, author, approver, when last yielded. The more meta data you have the merrier - it often helps to tie disparate data sources together and enriches the overall data pool
10Search, cost, book funnel Internal Yes Traditionally the core of any ecommerce strategy - measures the complete search, cost and book journey. Needs to be fully instrumented to collect data so that A/B and multivariate testing can be used to fine-tune performance over time. Google Analytics does this very, very well.
11Offline (shop) interactions Internal Yes Few businesses try to tie shop activity back to online activity, but for a bricks and mortar plus clicks business, this is an opportunity missed
12Online advertising (SEO) Internal Partially SEO can be thought of as PPC you don't pay for! Critical to making cost of acquisition online as efficient as possible. Only partially controllable due to businesses being at the mercy of search engine scoring (which both Google and Microsoft (Bing) keep as a black box algorithm)
13Online advertising (PPC) Internal Yes Where Google makes its money!.. PPC has pride of place in every well-constructed ecommerce campaign, but the cost and effectiveness should be continuously monitored, challenged and tuned. CSV exports out of AdWords provide a good way to do this
14Personalisation Internal Yes Personalisation - both anonymous and known, is a great way to learn what kind of holiday / vacation people want to buy from you and how they want to find and buy it. Just don't try to build personalisation before you have (10) working well - personalisation needs a really solid foundation to work well..
15Social media External No The rising star that no-one really knows how to handle. The Facebook API contains a lot of potential for travel ecommerce 
16Offline / traditional advertising External Yes The efficacy (or not) of ad spend must extend to traditional / offline as well as the more easily measurable online variant, otherwise you don't know where all of your marketing £s / $s / €s are going
17Post-booking interactions Internal Yes ecommerce data source, but savvy businesses are now looking at post-booking amendments, cancellation rates etc. to identify patterns that can feed back into the search experience
18Customer Relationship Management (CRM) Internal Yes Both pre and post travel - it's key to have a good view of what the customer experiences on holiday and feed that back into what holidays are sold going forward. Is that picture of the pool misleading - change it! If the service is great, promote it more!

Table 1. A proposed taxonomy on data sources that impact and influence leisure travel ecommerce.

Two important characteristics of data are whether you control it or not (and hence can change it if you need to) and whether it is sourced from an internal system or an external system (and thus how trustworthy / accurate the data is and whether it is unique to you or if other business entities can see it too). We have added these two characteristics to the table above for clarity.

What should be obvious to the reader is that a holistic picture of ecommerce performance requires multiple data sources, some of which traditionally would not be seen as impacting the effectiveness of a leisure travel ecommerce system. Gone are the days of simply looking at the web logs to see how effective (or leaky) the conversion funnel is! In fact, there are probably some sources that I've inadvertently omitted, and indeed as new systems come on stream, new sources will be added to this table / taxonomy.

Finally, it's interesting from a barrier to entry perspective to note that only the well-placed tour operator or OTA actually has the wherewithal and access to collate data from all of the sources noted in the table. Other new entrants simply do not have access to many of the sources listed. The data itself is now a valuable commodity (and is increasing in value), and an asset that leisure travel businesses would do well to guard jealously.

What we need - Systems and Data working together

At present, I contend that the average tour operator / OTA is collecting some, but not all of the data sources identified, and that no tour operator or OTA has yet constructed a system that provides a holistic, joined-up view of the data back to the business function to inform decision-making activities. Why not? Because it's not easy to do! The IT estate behind these data sources is fragmented (core res system, yielding system, multiple content management systems, external systems, separate booking repositories / agency management systems, Google Analytics, Google AdWords, Excel spreadsheets), often owned by different companies and wasn't designed to provide with the kind of view that is now needed. Ominously, new entrants into the space do not have a lot of the legacy baggage that incumbents do, meaning their velocity of implementation and ongoing change creates a hard-to-ignore imperative for all sellers of leisure travel to innovate quickly and learn from their data, or be left behind.

The technical challenge is four-fold:

1. Collection and storage - gather and store as much data as possible for each data source in the table, with that data being as clean and structured as possible (and in the real world, every data set will have some noise to it)

2. Build a holistic, joined-up data set - identify ways to link the data sources together - version number, unique keys, foreign keys, link backs, tagging etc. The more your data sources are joined up, the more holistic a view of the business you are building (and can provide back to the business). Conversely, disconnected data sets (data islands) are of much less value to the business and introduce the risk of an incomplete / inaccurate view of what's really happening now being used to influence what's going to happen next

3. Answering the questions - provide a mechanism to answer questions over this corpus of data in near real-time to allow the business to modify its behaviour and focus to maximise profits, yield and margin

4. Suggesting the questions - once the above three points have been implemented to a mature and repeatable level, the final logical step is for the data function to actually suggest areas of improvement and further exploration based on emergent patterns in the data, using techniques such as artificial neural network and self-organising maps (SOM) analysis

Putting it all together - a suggested framework

There are many ways to construct a view over the data sources identified in the previous section. And in fact, multiple views are encouraged depending on the goal of the business. Here however, a hybrid of time and business function is selected in order to select a reasonable framework to hold the data. This framework is depicted in the following diagram.

Figure 1. High-level schematic of the big data system for leisure travel ecommerce.

A concrete implementation of the framework

The question naturally arises - how would this system be constructed, not just initially but also maintained and extended going forward?

Some natural candidates already exist, chief among them Cassandra and Hadoop. In the author's opinion, a hybrid architecture of Cassandra's data storage and innate simplicity and high availability, coupled with the MapReduce framework from Hadoop offers the best blend of performance, scalability, availability / resilience, querying and extensibility. A separate follow-on instalment to this article is warranted to provide a detailed technical treatise on the underpinnings of the system outlined here.


The dominant data sources that impact the effectiveness of a leisure travel ecommerce strategy are identified, named and classified. Developing this classification further, a model is used to create a framework to house the data sources and a concrete implementation suggested.

About the author: Humphrey is the Chief Technology Officer for Comtec Group, a company that specializes in leisure travel technology.

Wednesday, March 02, 2011

JDK 7 preview and JEE 7 planning

We got two interesting developments in Java land this week:

1. Oracle released the developer preview of the Java 7 Development Kit (JDK)

2. Oracle have started talking publically about what JEE 7 (and beyond - JEE 8) will look like in Q3 2012 and Q4 2013.

(1) has been a long time coming and it's good to see the log jam moving. Simply shipping JDK 7 is good in its own right but it also means that the team will move onto working on JDK 8, which contains some key language features omitted from JDK 7 so that the team could JGIOTFD (Just Get It Out The (reader exercise to complete the acronym)).

(2) looks to be Oracle really making the JEE stack cloud-based / cloud-friendly by default rather than a technology stack that merely facilitates cloud computing. This dynamic should see Oracle formalising exactly what constitutes "JEE in the cloud" via a JSR and thus wresting that intellectual responsibility back from Google's App Engine platform, which is pretty much the de facto standard for "JEE in the cloud" at present.

Looking beyond JEE 7, JEE 8 looks to be embracing Big Data / NoSQL systems like Hadoop and Cassandra, although we can expect to have seen significant consolidation in this space by 2013, making the integration and platform support task easier to accomplish.

All in all, two nice moves, and good news for the Java eco system / economy. You might or might not like Oracle, but they are getting stuff out the door in a way that Sun kind of forgot how to do.