Monday, October 06, 2014

Named Entity Recognition - short tutorial and sample business application

A latent theme is emerging quite quickly in mainstream business computing - the inclusion of Machine Learning to solve thorny problems in very specific problem domains. For me, Machine Learning is the use of any technique where system performance improves over time by the system either being trained or learning.

In this short article, I will quickly demonstrate how an off the shelf Machine Learning package can be used to add significant value to vanilla Java code for language parsing, recognition and entity extraction. In this example, adopting an advanced, yet easy to use, Natural Language Parser (NLP) combined with Named Entity Recognition (NER), provides a deeper, more semantic and more extensible understanding of natural text commonly encountered in a business application than any non-Machine Learning approach could hope to deliver.

Machine Learning is one of the oldest branches of Computer Science. From Rosenblatt's perceptron in 1957 (and even earlier), Machine Learning has grown up alongside other subdisciplines such as language design, compiler theory, databases and networking - the nuts and bolts that drive the web and most business systems today. But by and large, Machine Learning is not straightforward or clear-cut enough for a lot of developers and until recently, its' application to business systems was seen as not strictly necessary. For example, we know that investment banks have put significant efforts applying neural networks to market prediction and portfolio risk management and the efforts of Google and Facebook with deep learning (the third generation of neural networks) has been widely reported in the last three years, particularly for image and speech recognition. But mainstream business systems do not display the same adoption levels..

Aside: accuracy is important in business / real-world applications.. the picture below shows why you now have Siri / Google Now on your iOS or Android device. Until 2009 - 2010, accuracy had flat-lined for almost a decade, but the application of the next generation of artificial neural networks drove the error rates down to a usable level for millions of users (graph drawn from Yoshua Bengio's ML tutorial at KDD this year).

Dramatic reduction in error rate on Switchboard data set post introduction of deep learning techniques.


Luckily you don't need to build a deep neural net just to apply Machine Learning to your project! Instead, let's look at a task that many applications can and should handle better - mining unstructured text data to extract meaning and inference.

Natural language parsing is tricky. There are any number of seemingly easy sentences which demonstrate how much context we subconsciously process when we read. For example, what if someone comments on an invoice: "Partial invoice (€100,000, so roughly 40%) for the consignment C27655 we shipped on 15th August to London from the Make Believe Town depot. INV2345 is for the balance.. Customer contact (Sigourney) says they will pay this on the usual credit terms (30 days).".

Extracting tokens of interest from an arbitrary String is pretty easy. Just use a StringTokenizer, use space (" ") as the separator character and you're good to go.. But code like this has a high maintenance overhead, needs a lot of work to extend and is fundamentally only as good as the time you invest into it. Think about stemming, checking for ',','.',';' characters as token separators and a whole slew more of plumbing code hoves into view.

How can Machine Learning help?

Natural Language Parsing (NLP) is a mature branch of Machine Learning. There are many NLP implementations available, the one I will use here is the CoreNLP / NER framework from the language research group at Stanford University. CoreNLP is underpinned by a robust theoretical framework, has a good API and reasonable documentation. It is slow to load though.. make sure you use a Factory + Singleton pattern combo in your code as it is thread-safe since ~2012. An online demo of a 7-class (recognises seven different things or entities) trained model is available at http://nlp.stanford.edu:8080/ner/process where you can submit your own text and see how well the classifier / tagger does. Here's a screenshot of the default model on our sample sentence:

Output from a trained model without the use of a supplementing dictionary / gazette.

You will note that "Make Believe Town" is classified (incorrectly in this case) as an ORGANIZATION. Ok, so let's give this "out of the box" model a bit more knowledge about the geography our company uses to improve its' accuracy. Note: I would have preferred to use the gazette feature in Stanford NER (I felt it was a more elegant solution), but as the documentation stated, gazette terms are not set in stone, behaviour that we require here.

So let's create a simple tab-delimited text file as follows:

Make Believe Town LOCATION

(make sure you don't have any blank lines in this file - RegexNER really doesn't like them!)

Save this one line of text into a file named locations.txt and place it in a location available to your classloader at runtime. I have also assumed that you have installed the Stanford NLP models and required jar files into the same location.

Now re-run the model, but this time asking CoreNLP to add the regexner to the pipeline.. You can do this by running the code below and changing the value of the useRegexner boolean flag to examine the accuracy with and without our small dictionary.

Hey presto! Our default 7-class model now has a better understanding of our unique geography, adding more value to this data mining tool for our company (check out the output below vs the screenshot from the default model above)..

Code


package phoenix;

import java.util.ArrayList;
import java.util.List;
import java.util.Properties;

import org.junit.Test;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import edu.stanford.nlp.ling.CoreLabel;
import edu.stanford.nlp.ling.CoreAnnotations.NamedEntityTagAnnotation;
import edu.stanford.nlp.ling.CoreAnnotations.SentencesAnnotation;
import edu.stanford.nlp.ling.CoreAnnotations.TextAnnotation;
import edu.stanford.nlp.ling.CoreAnnotations.TokensAnnotation;
import edu.stanford.nlp.pipeline.Annotation;
import edu.stanford.nlp.pipeline.StanfordCoreNLP;
import edu.stanford.nlp.util.CoreMap;


/**
 * Some simple unit tests for the CoreNLP NER (http://nlp.stanford.edu/software/CRF-NER.shtml) short
 * article.
 * 
 * @author hsheil
 *
 */
public class ArticleNlpRunner {

  private static final Logger LOG = LoggerFactory.getLogger(ArticleNlpRunner.class);

  @Test
  public void basic() {
    LOG.debug("Starting Stanford NLP");

    // creates a StanfordCoreNLP object, with POS tagging, lemmatization, NER, parsing, and
    Properties props = new Properties();
    boolean useRegexner = true;
    if (useRegexner) {
      props.put("annotators", "tokenize, ssplit, pos, lemma, ner, regexner");
      props.put("regexner.mapping", "locations.txt");
    } else {
      props.put("annotators", "tokenize, ssplit, pos, lemma, ner");
    }
    StanfordCoreNLP pipeline = new StanfordCoreNLP(props);

    // // We're interested in NER for these things (jt->loc->sal)
    String[] tests =
        {
            "Partial invoice (€100,000, so roughly 40%) for the consignment C27655 we shipped on 15th August to London from the Make Believe Town depot. INV2345 is for the balance.. Customer contact (Sigourney) says they will pay this on the usual credit terms (30 days)."
        };
    List tokens = new ArrayList<>();

    for (String s : tests) {

      // run all Annotators on the passed-in text
      Annotation document = new Annotation(s);
      pipeline.annotate(document);

      // these are all the sentences in this document
      // a CoreMap is essentially a Map that uses class objects as keys and has values with
      // custom types
      List sentences = document.get(SentencesAnnotation.class);
      StringBuilder sb = new StringBuilder();
      
      //I don't know why I can't get this code out of the box from StanfordNLP, multi-token entities
      //are far more interesting and useful..
      //TODO make this code simpler..
      for (CoreMap sentence : sentences) {
        // traversing the words in the current sentence, "O" is a sensible default to initialise
        // tokens to since we're not interested in unclassified / unknown things..
        String prevNeToken = "O";
        String currNeToken = "O";
        boolean newToken = true;
        for (CoreLabel token : sentence.get(TokensAnnotation.class)) {
          currNeToken = token.get(NamedEntityTagAnnotation.class);
          String word = token.get(TextAnnotation.class);
          // Strip out "O"s completely, makes code below easier to understand
          if (currNeToken.equals("O")) {
            // LOG.debug("Skipping '{}' classified as {}", word, currNeToken);
            if (!prevNeToken.equals("O") && (sb.length() > 0)) {
              handleEntity(prevNeToken, sb, tokens);
              newToken = true;
            }
            continue;
          }

          if (newToken) {
            prevNeToken = currNeToken;
            newToken = false;
            sb.append(word);
            continue;
          }

          if (currNeToken.equals(prevNeToken)) {
            sb.append(" " + word);
          } else {
            // We're done with the current entity - print it out and reset
            // TODO save this token into an appropriate ADT to return for useful processing..
            handleEntity(prevNeToken, sb, tokens);
            newToken = true;
          }
          prevNeToken = currNeToken;
        }
      }
      
      //TODO - do some cool stuff with these tokens!
      LOG.debug("We extracted {} tokens of interest from the input text", tokens.size());
    }
  }
  private void handleEntity(String inKey, StringBuilder inSb, List inTokens) {
    LOG.debug("'{}' is a {}", inSb, inKey);
    inTokens.add(new EmbeddedToken(inKey, inSb.toString()));
    inSb.setLength(0);
  }


}
class EmbeddedToken {

  private String name;
  private String value;

  public String getName() {
    return name;
  }

  public String getValue() {
    return value;
  }

  public EmbeddedToken(String name, String value) {
    super();
    this.name = name;
    this.value = value;
  }
}


Output

16:01:15.465 [main] DEBUG phoenix.ArticleNlpRunner - Starting Stanford NLP
Adding annotator tokenize
TokenizerAnnotator: No tokenizer type provided. Defaulting to PTBTokenizer.
Adding annotator ssplit
edu.stanford.nlp.pipeline.AnnotatorImplementations:
Adding annotator pos
Reading POS tagger model from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [1.5 sec].
Adding annotator lemma
Adding annotator ner
Loading classifier from edu/stanford/nlp/models/ner/english.all.3class.distsim.crf.ser.gz ... done [6.6 sec].
Loading classifier from edu/stanford/nlp/models/ner/english.muc.7class.distsim.crf.ser.gz ... done [3.1 sec].
Loading classifier from edu/stanford/nlp/models/ner/english.conll.4class.distsim.crf.ser.gz ... done [8.6 sec].
sutime.binder.1.
Initializing JollyDayHoliday for sutime with classpath:edu/stanford/nlp/models/sutime/jollyday/Holidays_sutime.xml
Reading TokensRegex rules from edu/stanford/nlp/models/sutime/defs.sutime.txt
Reading TokensRegex rules from edu/stanford/nlp/models/sutime/english.sutime.txt
Oct 06, 2014 4:01:37 PM edu.stanford.nlp.ling.tokensregex.CoreMapExpressionExtractor appendRules
INFO: Ignoring inactive rule: null
Oct 06, 2014 4:01:37 PM edu.stanford.nlp.ling.tokensregex.CoreMapExpressionExtractor appendRules
INFO: Ignoring inactive rule: temporal-composite-8:ranges
Reading TokensRegex rules from edu/stanford/nlp/models/sutime/english.holidays.sutime.txt
Adding annotator regexner
TokensRegexNERAnnotator regexner: Read 1 unique entries out of 1 from locations.txt, 0 TokensRegex patterns.
16:01:38.077 [main] DEBUG phoenix.ArticleNlpRunner - '$ 100,000' is a MONEY
16:01:38.080 [main] DEBUG phoenix.ArticleNlpRunner - '40 %' is a PERCENT
16:01:38.080 [main] DEBUG phoenix.ArticleNlpRunner - '15th August' is a DATE
16:01:38.080 [main] DEBUG phoenix.ArticleNlpRunner - 'London' is a LOCATION
16:01:38.080 [main] DEBUG phoenix.ArticleNlpRunner - 'Make Believe Town' is a LOCATION
16:01:38.080 [main] DEBUG phoenix.ArticleNlpRunner - 'Sigourney' is a PERSON
16:01:38.081 [main] DEBUG phoenix.ArticleNlpRunner - '30 days' is a DURATION
16:01:38.081 [main] DEBUG phoenix.ArticleNlpRunner - We extracted 7 tokens of interest from the input text

There are some caveats though - your dictionary needs to be carefully selected to not overwrite the better "natural" performance of Stanford NER using its' Conditional Random Field (CRF)-inspired logic augmented with Gibbs Sampling. For example, if you have a customer company called Make Believe Town Limited (unlikely, but not impossible), then Stanford NER will mis-classify Make Believe Town Limited to Make Believe Town. However, with careful dictionary population and a good understanding of the target raw text corpus, this is still a very fruitful approach.

Summary


In summary, a robust natural language parser with integrated Named Entity Recognition like the Stanford NLP libraries used here provide a strong base to build from for business applications needing more powerful text analysis, particularly in conjunction with approaches like gazettes that allow the overlay of business terms to improve the accuracy of the vanilla model.


















Sunday, September 30, 2012

SCEA study guide errata

Thanks to Kristiyan Marinov who sent in this comment calling out three typos in the book. I figure this is interesting enough to other readers of the book currently working through the exam to re-publish the comment in full below along with my reply..

Original comment

Hi,


I recently read your book and find it an interesting and helpful read. I found a couple of mistakes during some of the test-yourself questions though.
Since I didn't manage to get a hold of you or Mark Cade any other way, I'll be posting my notes in this comment.

Typo 1: On page 81, question 6. The Answer says D but the explanation below it explains why C is the correct answer (which it is indeed).

Typo 2: On page 95, question 1. The Answer says C and D but the correct answers are B and C, as is given in the explanation.

Typo 3: On page 148, question 5. The Answer says B, C and F while the actual answers are B, D and F.

Thanks for the otherwise great book!

Kristiyan

My reply



Hi Kristiyan

Thanks for the comments. These are all typos that need to go into the errata for the book. I'll reach out to the publshers to see when the next run is and also create a blog post to call out these three for other readers. Thanks!

Humphrey


Sunday, July 08, 2012

Fixing the Google Analytics API (v3) examples


I'm currently working on consuming data from the Google Analytics API for one of the data sources we're using at Eysys.

So far, so normal. But what was strange was the poor state of the Google Analytics API documentation. I don't think I've ever seen one of their APIs be documented so poorly - missing source code, typos in example source code provided, a real rambling tone to the docs pointing off to different areas (a lot of this is to do with the OAuth 2.0 hoops you have to jump through before even starting to pull down the data you want to analyse).

I also couldn't believe how many dependencies the API has - I ended up with 29 jar files in my Eclipse project's lib folder! Surely this could all be a lot leaner, meaner and easier - it's just an API returning JSON data at the end of the day..

Anyway, if you're interested in getting up and running with the API examples, here's what to fix.

First of all, there is actually missing source code in the distro itself (or at least the one I used - google-api-services-analytics-v3-rev10-1.7.2-beta.zip), so you need to get LocalServerReceiver, OAuth2Native and VerificationCodeReceiver directly from the Google code repo.

LocalServerReceiver uses an old version of Jetty, so we need to migrate it to use the latest Jetty (I used 8.1.4.v20120524) which now has an org.eclipse.* package structure. So we need to update the imports as follows:

import org.eclipse.jetty.server.Connector;
import org.eclipse.jetty.server.Request;
import org.eclipse.jetty.server.Server;
import org.eclipse.jetty.server.handler.AbstractHandler;

Refreshing Jetty will also necessitate upgrading the handle(..) method to fit in with the new signature, as follows:

@Override
public void handle(String target, Request arg1,
HttpServletRequest request, HttpServletResponse response)
throws IOException, ServletException {
if (!CALLBACK_PATH.equals(target)) {
return;
}
writeLandingHtml(response);
response.flushBuffer();
((Request) request).setHandled(true);
String error = request.getParameter("error");
if (error != null) {
System.out.println("Authorization failed. Error=" + error);
System.out.println("Quitting.");
System.exit(1);
}
code = request.getParameter("code");
synchronized (LocalServerReceiver.this) {
LocalServerReceiver.this.notify();
}
}


There's also a small mod to be made in getRedirectUri(), change this line:

server.addHandler(new CallbackHandler());

to this instead:

server.setHandler(new CallbackHandler());


The logic of this class also seems pretty flawed to me - generating a random port for the redirect URI that OAuth calls back to at the end of authentication every time it's run, which by definition you won't be able to put into the APIs console. So I commented out the getUnusedPort() method and simply hard-coded one.

And after these mods, hey presto it works! :-)

Friday, May 11, 2012

Writing a new book about software development!


Following a respectable interval of time after the launch of the official Enterprise Architect study guide (which was absolutely necessary to allow painful memories of the writing and editing process to fade :-) ), I've teamed up with my editor from that book - Greg Doench  from Pearson on a new book about software development. I can't believe that Greg is signing up for round two with me, and am grateful to have him on board again!

The central premise of this new book stems from an observation that I have seen time and time again - a lot of smart people in business that I work with just don't get software at any kind of meaningful level - the coders who program it, their unique culture, the actual process of designing and writing software, and most importantly - why things (inevitably) go wrong and how to fix them when they do (and for "wrong", you can substitute any value of "late", "over-budget" or "doesn't do what it should" or all three that floats your boat).

This disconnect would be ok if it weren't for the fact that these same smart business people almost always end up in a position where software projects are a key part of what they need to achieve - they become customers, or key stakeholders. If they rise high enough, they become actual budget holders - then it gets interesting! Simply put, it's rapidly becoming a career-limiting move in business to say that "I'm not technical". And motivated business people who want to become conversant with their software projects are finding a gap in readable, digestible content that helps them to bridge their gap in understanding. That's where this book comes in.

The book structure itself is pretty new - although the chapters are designed to be read together (although not in a regimented order), Greg is encouraging me to write the chapters so that they will also read "standalone". There's a strong chance then that individual chapters will be available well before the book is scheduled to complete in Q2 of 2013.

What the book is:

* A guide to software development for people who are not technical by background and want to learn
* A map to navigate a software project by - regardless of programming language used or target application
* A guide that should stand the test of time - it's not about buzzwords, it's about the core building blocks that make up software projects

What the book isn't:

* An idiot's guide to software development - you will be stretched intellectually by the content we're planning to put into the book
* A technology-specific guide - I'll be consciously writing the book with a view to covering all technologies, and concrete examples will be provided across the spectrum of commonly-used programming languages

Time to start writing!

Wednesday, March 21, 2012

Awesome Java developers wanted!

We're hiring! If you've got a hankering to work for a startup operating in stealth mode based in Cardiff, Wales, using the latest Java frameworks and techniques to build the coolest ecommerce platform around, read on!


ABOUT THE OPPORTUNITY

Once in a while, a chance comes up to be part of something special. We are looking for excellent developers to join our team and help build the next-generation global ecommerce platform incorporating big data mining and analysis, machine learning, cloud computing (EC2 and App Engine) and the latest advances in online commerce.

WHO WE ARE LOOKING FOR

* Our platform is primarily Java-based, so strong Java and OOD skills are an absolute must

* A willingness to pro-actively research and use new libraries and projects as needed to add new platform capabilities to complete our roadmap

* Experience with Linux and MySQL is also advantageous

* You will have a solid grounding in how web based server side applications and databases work

* Be comfortable working in a rapid iteration development cycle moving from prototype to production while engineering to a high level of quality, using leading automated testing techniques

* Enjoy / understand the importance of working in all layers of the platform architecture - UI, business logic and persistence

* Understand how to describe and design a system in terms of data structures and algorithms, in order to participate effectively in core design workshops

* Be totally committed to writing the most efficient, scalable and robust code possible, and to continously improve your ability in this area

* Prior experience with Bayesian techniques and artificial neural networks is beneficial, but is not strictly necessary

* Prior experience with Hadoop and HBase is beneficial, but is not strictly necessary

* Most important of all.. where you don't know something, be happy and ready to roll your sleeves up and learn it!


ABOUT US

Our culture is to work hard using the latest and most relevant technologies and to have lots of fun while doing it! We believe passionately in building and delivering truly game-changing software to our customers. Our ideal candidates are self-starting, good communicators, love coding and work well in a team.

For more information and to submit a CV, please email careers@eysys.com.

To all recruitment agencies: eysys does not accept agency CVs. Please do not forward CVs to our jobs alias, eysys employees or any other company location. eysys is not responsible for any fees related to unsolicited CVs.

Sunday, August 07, 2011

NoSQL / NewSQL / SQL - future-proofing your persistence architecture (part one)

Although its been a few years in the making, the noise / buzz around NoSQL has now reached fever pitch. Or to be more precise, the promise of something better / faster / cheaper / more scalable than standard RDBMSs has sucked in a lot of people (plus getting to use MapReduce in an application even if it's not needed is a temptation very hard to resist..). And pretty recently, the persistence hydra has grown another head - NewSQL. NewSQL adherents essentially believe that NoSQL is a design pig and that a better approach is to fix relational databases. In turn, NewSQL claims have been open to counter-claim on the constraints inherent in the NewSQL approach. It's all very fascinating (props for working Lady Gaga into a technical article as well..).

As it turns out, traditional RDBMSs are sometimes slow for valid reasons, and while you can certainly speed things up by relaxing constraints or optimising heavily for a specific use case, that's not a panacea or global solution to the problem of a generic, fast way to store and access structured data. On the other hand, the assertion that Oracle, MySQL and SQL Server have become fat and inefficient because of backwards compatibility requirements definitely strikes a chord with me personally.

The sheer variety of NoSQL candidates (this web page lists ~122!) is evidence that the space is still immature. I don't have a problem with that (every technology goes through the same cycle), but it does raise one nasty problem: what happens if you back the wrong candidate now in 2012 that has disappeared in 2015?

The current NoSQL marketplace demands a defensive architecture approach - it's reasonable to expect that over the next three years some promising current candidates will lose momentum and support, others will merge and still others will be bought up by a commercial RDBMS vendor, and become quite costly to license.

What we need is a good, implementation-independent abstraction layer to model the reading and writing from and to a NoSQL store. No hard coding of specific implementation details into multiple layers of your application - instead segregate that reading and writing code into a layer that is written with change in mind - we're talking about pluggable modules, sensible use of interfaces and design patterns to make the replacement of your current NoSQL squeeze as low-pain as possible if and when that replacement is ever needed.

If the future shows that the current trade-offs made in the NoSQL space (roughly summed up as - a weaker take on A(tomicity),C(onsistency), I(solation) or D(urability), plus with your own favourite blend of Brewer's CAP theorem) are rendered unnecessary by software and hardware advances (as is very likely to be the case), then the API should ideally insulate our application code from this change.

There are interesting moves afoot that demonstrate that the community is actively thinking about this, specifically the very recent announcement ) of UnQL (the NoSQL equivalent to SQL - i.e. a unified NoSQL Query Language). That's good, but UnQL is young enough to shrivel and die just like any of the NoSQL implementations themselves. Also, we know that what has inspired UnQL - SQL - is itself fragmented / with vendor-specific extensions like T-SQL from Microsoft and PL/SQL from Oracle.

So then, in part one of this two-parter, I've worked to justify what's coming in part two - a minimal set of Java classes and interfaces to provide a concrete implementation of the abstract ideas discussed above.

Sunday, July 31, 2011

New Google Analytics location report doesn't like Connacht so much..

The new UI for Google Analytics has a distinctly Cromwellian vibe to it, as the screenshot below shows. Is this just my GA account, or does everyone else see Galway and Sligo a bit more surrounded by the Atlantic than normal?