본문 바로가기

Machine Learning/Memo

아파치 그룹의 mahout 기계학습 라이브러리 책




아파치 그룹에서 추진하던 루씬(Lucene) 프로젝트의 하위 프로젝트로 2008년에 시작되었는데,(루씬 프로젝트는 오픈소스기반의 검색 엔진), 이 프로젝트를 위해서 기계학습 라이브러리가 필요하게 되었는데.. 이것을 개발하다가 따로 떨어지게 되면서 오픈소스 기반의 Taste 협업 필터링 프로젝트를 흡수하면서 아파치 그룹의 Top 레벨의 프로젝트로 2010년에 탄생하게 되었다고 한다.


Mahout began life in 2008 as a subproject of Apache’s Lucene project, which providesthe well-known open source search engine of the same name. Lucene provides advanced implementations of search, text mining, and information-retrieval techniques. In the universe of computer science, these concepts are adjacent to machine learning techniques like clustering and, to an extent, classification. As a result, some of the work of the Lucene committers that fell more into these machine learning areas was spun off into its own subproject. Soon after, Mahout absorbed the Taste open source collaborative filtering project.


As of April 2010, Mahout became a top-level Apache project in its own right....


<Mahout in Action 원문에서 발췌..>


이 내용을 작년에 듣기는 했는데, 큰 관심을 가지고 있지 않다가 최근에 책을 구매하게 되면서 한번 살펴보려고 했다. 그런데, 구글에서 검색하니깐 이것이 원서로는 pdf를 다운로드 받을 수 있다. 영어가 편하신 분들은 이 pdf를 구글에서 검색하면 쉽게 다운 받을 수 있다... 혹시나 하여 여기에도 올려둔다.(10M가 넘기때문에 압축함)


[Mahout.in.Action(2011)].Sean.Owen.zip



이 책에 포함되어 있는 알고리즘들의 리스트이다. (https://cwiki.apache.org/confluence/display/MAHOUT/Algorithms 참고)


Classification

A general introduction to the most common text classification algorithms can be found at Google Answers: http://answers.google.com/answers/main?cmd=threadview&id=225316 For information on the algorithms implemented in Mahout (or scheduled for implementation) please visit the following pages.



Logistic Regression (SGD)

Bayesian

Support Vector Machines (SVM) (open: MAHOUT-14, MAHOUT-232 and MAHOUT-334)

Perceptron and Winnow (open: MAHOUT-85)

Neural Network (open, but MAHOUT-228 might help)

Random Forests (integrated - MAHOUT-122, MAHOUT-140, MAHOUT-145)

Restricted Boltzmann Machines (open, MAHOUT-375, GSOC2010)

Online Passive Aggressive (integrated, MAHOUT-702)

Boosting (awaiting patch commit, MAHOUT-716)

Hidden Markov Models (HMM) (MAHOUT-627, MAHOUT-396, MAHOUT-734) - Training is done in Map-Reduce


Clustering

Reference Reading


Canopy Clustering (MAHOUT-3 - integrated)

K-Means Clustering (MAHOUT-5 - integrated)

Fuzzy K-Means (MAHOUT-74 - integrated)

Expectation Maximization (EM) (MAHOUT-28)

Mean Shift Clustering (MAHOUT-15 - integrated)

Hierarchical Clustering (MAHOUT-19)

Dirichlet Process Clustering (MAHOUT-30 - integrated)

Latent Dirichlet Allocation (MAHOUT-123 - integrated)

Spectral Clustering (MAHOUT-363 - integrated)

Minhash Clustering (MAHOUT-344 - integrated)

Top Down Clustering (MAHOUT-843 - integrated)



Pattern Mining

Parallel FP Growth Algorithm (Also known as Frequent Itemset mining)



Regression

Locally Weighted Linear Regression (open)



Dimension reduction

Singular Value Decomposition and other Dimension Reduction Techniques (available since 0.3)


Stochastic Singular Value Decomposition with PCA workflow (PCA workflow now integrated)


Principal Components Analysis (PCA) (open)


Independent Component Analysis (open)


Gaussian Discriminative Analysis (GDA) (open)



Evolutionary Algorithms

NOTE: * Watchmaker support has been removed as of 0.7

see also: MAHOUT-56 (integrated)


You will find here information, examples, use cases, etc. related to Evolutionary Algorithms.


Introductions and Tutorials:


Evolutionary Algorithms Introduction

How to distribute the fitness evaluation using Mahout.GA

Examples:


Traveling Salesman

Class Discovery



Recommenders / Collaborative Filtering

Mahout contains both simple non-distributed recommender implementations and distributed Hadoop-based recommenders.


Non-distributed recommenders ("Taste") (integrated)

Distributed Item-Based Collaborative Filtering (integrated)

Collaborative Filtering using a parallel matrix factorization (integrated)

First-timer FAQ



Vector Similarity

Mahout contains implementations that allow one to compare one or more vectors with another set of vectors. This can be useful if one is, for instance, trying to calculate the pairwise similarity between all documents (or a subset of docs) in a corpus.


RowSimilarityJob – Builds an inverted index and then computes distances between items that have co-occurrences. This is a fully distributed calculation.

VectorDistanceJob – Does a map side join between a set of "seed" vectors and all of the input vectors.



Other

Collocations



Non-MapReduce algorithms

Some algorithms and applications appeared on the mailing list, that have not been published in map reduce form so far. As we do not restrict ourselves to Hadoop-only versions, these proposals are listed here.