Monday, October 31, 2011

Books, tutorials, and talks on Mahout

WatchMaker: Framework for genetic programming

Reading through the Apache Mahout wiki, I ran across the "Genetic Programming" section which listed WatchMaker, but the wiki page it pointed to had essentially zero descriptive information about WatchMaker itself or even a link to WatchMaker itself. Here's the link to the WatchMaker web page, which tells us that:
The Watchmaker Framework is an extensible, high-performance, object-oriented framework for implementing platform-independent evolutionary/genetic algorithms in Java. The framework provides type-safe evolution for arbitrary types via a non-invasive API. The Watchmaker Framework is Open Source software, free to download and use subject to the terms of the Apache Software Licence, Version 2.0.
Just to briefly summarize genetic programming (from the WatchMaker User Manual):
Evolutionary algorithms (EAs) are inspired by the biological model of evolution and natural selection first proposed by Charles Darwin in 1859. In the natural world, evolution helps species adapt to their environments. Environmental factors that influence the survival prospects of an organism include climate, availability of food and the dangers of predators.
Species change over the course of many generations. Mutations occur randomly. Some mutations will be advantageous, but many will be useless or detrimental. Progress comes from the feedback provided by non-random natural selection.
Evolutionary algorithms are based on a simplified model of this biological evolution. To solve a particular problem we create an environment in which potential solutions can evolve. The environment is shaped by the parameters of the problem and encourages the evolution of good solutions.
The field of Evolutionary Computation encompasses several types of evolutionary algorithm. These include Genetic Algorithms (GAs), Evolution Strategies, Genetic Programming (GP), Evolutionary Programming and Learning Classifier Systems.
The most common type of evolutionary algorithm is the generational genetic algorithm. We'll cover other EA variants in later chapters but, for now, all of the evolutionary algorithms that we meet will be some kind of generational GA.
The basic outline of a generational GA is as follows (most other EA variants are broadly similar). A population of candidate solutions is iteratively evolved over many generations. Mimicking the concept of natural selection in biology, the survival of candidates (or their offspring) from generation to generation in an EA is governed by a fitness function that evaluates each candidate according to how close it is to the desired outcome, and a selection strategy that favours the better solutions. Over time, the quality of the solutions in the population should improve. If the program is successful, we can terminate the evolution once it has found a solution that is good enough.
-- Jack Krupansky

Link to Apache Mahout Taste documentation

While reading through the Apache Mahout wiki I ran across a broken link to the Taste documentation (the broken link) on the Quickstart page. The Apache Mahout Taste documentation is here (the proper link.)
Apache Taste is a recommendation engine:
Taste is a flexible, fast collaborative filtering engine for Java. The engine takes users' preferences for items ("tastes") and returns estimated preferences for other items. For example, a site that sells books or CDs could easily use Taste to figure out, from past purchase data, which CDs a customer might be interested in listening to.
Taste provides a rich set of components from which you can construct a customized recommender system from a selection of algorithms. Taste is designed to be enterprise-ready; it's designed for performance, scalability and flexibility. Taste is not just for Java; it can be run as an external server which exposes recommendation logic to your application via web services and HTTP.

Leo Breiman's paper on Random Forests

I was reading the Apache Mahout wiki and got to the Breiman Example page and see that the link to Leo Breiman's paper (Random Forests) is broken. Here is the correct link to the paper. You can download the PDF as well.
The subject of the document is document classification.

Monday, October 24, 2011

Moving on from Hadoop to Mahout

I've finished reading the Apache Hadoop tuturial (from Yahoo). I didn't do any of the exercises, but at least I have more than a passing familiarity with what Hadoop is all about and how it is well-positioned to cope with Big Data.
Now, I'm moving on to reading up on Apache Mahout. Mahout's goal is  to build scalable machine learning libraries for recommendation mining, clustering, classification, frequent itemset mining, and similar purposes. There is actually a book on Mahout (Mahout in Action), but for now I'll focus on "mining" the Mahout wiki, which seems to have a lot of useful info which is likely sufficient for my immediate needs.
Mahout is implemented on top of Hadoop.
In the back of my head I'm thinking about entity extraction or named-entity extraction or named-entity recognition or NER as it is called. In theory, Mahout greatly facilitates NER.

Monday, October 17, 2011

Looking at Hadoop and Mahout

Since I finished my most recent contract work assignment on Friday I'm going to spend some time reading up on Hadoop and Mahout.
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-avaiability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-availabile service on top of a cluster of computers, each of which may be prone to failures.
The Apache Mahout machine learning library's goal is to build scalable machine learning libraries.
Currently Mahout supports mainly four use cases: Recommendation mining takes users' behavior and from that tries to find items users might like. Clustering takes e.g. text documents and groups them into groups of topically related documents. Classification learns from exisiting categorized documents what documents of a specific category look like and is able to assign unlabelled documents to the (hopefully) correct category. Frequent itemset mining takes a set of item groups (terms in a query session, shopping cart content) and identifies, which individual items usually appear together.
In short, Big Data and cluster-based computing.
Today I'm reading through the Hadoop Tutorial.

-- Jack Krupansky