BEGIN:VCALENDAR
VERSION:2.0
PRODID:-//MLUNCH//NONSGML v1.0//EN
X-WR-CALNAME:MLUNCH
X-WR-TIMEZONE:America/New_York
BEGIN:VEVENT
DTSTART:20070912T120000
DTEND:20070912T133000
SUMMARY:A Theory of Similarity Functions for Learning and Clustering
DESCRIPTION:Nina Balcan, CMU<br><br>Kernel methods have proven to be powerful tools in 
 machine learning. They perform well in many applications, and there is also
  a well-developed theory of sufficient conditions for a kernel to be useful
  for a given learning problem. However, while a kernel can be thought of as
  just a pairwise similarity function that satisfies additional mathematical
  properties, this theory requires viewing kernels as implicit (and often di
 fficult to characterize) maps into high-dimensional spaces. In this work we
  develop an alternative theory of learning with similarity functions more g
 enerally that does not require reference to implicit spaces, and does not r
 equire the function to be positive semi-definite. Our results also generali
 ze the standard theory in the sense that any good kernel function under the
  usual definition can be shown to also be a good similarity function under 
 our definition.<br><br>An interesting feature of the proposed framework is 
 that it can also be applied to learning from purely unlabeled data, i.e., c
 lustering. In particular, one can ask how much stronger the properties of a
  similarity function should be (in terms of its relation to the unknown des
 ired clustering) so that it can be used to cluster well: to learn well with
 out any label information at all. We find that if we are willing to relax t
 he objective (for example, allow the algorithm to produce a hierarchical cl
 ustering that we will call successful if some pruning is close to the corre
 ct answer), then this question leads to a number of interesting graph-theor
 etic and game-theoretic properties that are sufficient to cluster well. Our
  work can be viewed as an approach to defining a PAC model for clustering w
 ith non-interactive feedback.<br><br>This talk is based on work joint with 
 Avrim Blum and Santosh Vempala.<br><br>Food: Mexican
LOCATION:Levine 512
END:VEVENT
BEGIN:VEVENT
DTSTART:20070926T120000
DTEND:20070926T133000
SUMMARY:Privacy-Preserving Belief Propagation and Sampling
DESCRIPTION:Jenn Wortman, CIS<br><br>Consider a network of human social contacts, and i
 magine that each party would like to compute or estimate their probability 
 of having contracted a contagious disease, which depends on the exposures t
 o the disease of their immediate neighbors in the network. If network parti
 cipants (or their proxy algorithms) engage in standard belief propagation, 
 each party would learn their probability of exposure conditioned on any evi
 dence, but could also learn a great deal more, including information about 
 the exposure probabilities of their neighbors. Obviously such leakage of no
 n-local information is highly undesirable in settings where we regard each 
 party in the network as a self-interested agent, and privacy is paramount.<
 br><br>In this work, we provide privacy-preserving versions of inference an
 d Gibbs sampling.  By privacy-preserving in the example above, we mean that
  each party learns their conditional probability of exposure to the disease
  and absolutely nothing else. More precisely, anything a party can efficien
 tly compute after having participated in the protocol, they could have effi
 ciently computed alone given only the value of their own conditional probab
 ility.  Thus the protocol leaks no additional information beyond its desire
 d outputs.<br><br>Classical and powerful tools from cryptography provide so
 lutions to these problems, but with the significant drawback of entirely ce
 ntralizing the privacy-preserving computation. Our protocols are faithful t
 o the network topology, requiring only the passing of messages between part
 ies separated by one or two hops in the network. Furthermore, our protocols
  largely preserve the algebraic structure of the original algorithms (for i
 nstance, the sum-product structure of belief propagation) and enjoy all the
  correctness guarantees of the originals (such as exact inference in trees 
 for belief prop or convergence of Gibbs sampling to the joint distribution)
 . Our technical methods show how to blend tools from cryptography (secure m
 ultiparty computation and blindable encryption) with local message-passing 
 algorithms in a way that preserves the original computations, but in which 
 all messages appear to be randomly distributed from the viewpoint of any in
 dividual.<br><br>This talk is based on joint work with Michael Kearns and J
 insong Tan. <br><br>Food: Thai
LOCATION:Levine 512
END:VEVENT
BEGIN:VEVENT
DTSTART:20071003T123000
DTEND:20071003T133000
SUMMARY:READING GROUP: Multi-View Regression via Canonical Correlation Analysis
DESCRIPTION:presented by Alex Kulesza, CIS<br><br>Sham M. Kakade, Dean P. Foster<br><br
 >In the multi-view regression problem, we have a regression problem where t
 he input variable (which is a real vector) can be partitioned into two diff
 erent views, where it is assumed that either view of the input is sufficien
 t to make accurate predictions -- this is essentially (a significantly weak
 er version of) the co-training assumption for the regression problem.<br><b
 r>We provide a semi-supervised algorithm which first uses unlabeled data   
 to learn a norm (or, equivalently, a kernel) and then uses labeled   data i
 n a ridge regression algorithm (with this induced norm) to   provide the pr
 edictor. The unlabeled data is used via canonical   correlation analysis (C
 CA, which is a closely related to PCA for two   random variables) to derive
  an appropriate norm over functions. We are   able to characterize the intr
 insic dimensionality of the subsequent   ridge regression problem (which us
 es this norm) by the correlation   coefficients provided by CCA in a rather
  simple expression.   Interestingly, the norm used by the ridge regression 
 algorithm is   derived from CCA, unlike in standard kernel methods where a 
 special   apriori norm is assumed (i.e. a Banach space is assumed). We disc
 uss   how this result shows that unlabeled data can decrease the sample   c
 omplexity.<br><br>Link: http://ttic.uchicago.edu/~sham/papers/ml/regression
 _cca.pdf<br><br>Food: Cosi sandwiches
LOCATION:Levine 512
END:VEVENT
BEGIN:VEVENT
DTSTART:20071010T120000
DTEND:20071010T133000
SUMMARY:Learning Bounds for Domain Adaptation
DESCRIPTION:John Blitzer, CIS<br><br>Empirical risk minimization offers well-known lear
 ning guarantees when training and test data come from the same domain. In t
 he real world, though, we often wish to adapt a classifier from a source do
 main with a large amount of training data to different target domain with v
 ery little training data. In this work we give uniform convergence bounds f
 or algorithms that minimize a convex combination of source and target empir
 ical risk. The bounds explicitly model the inherent trade-off between train
 ing on a large but inaccurate source data set and a small but accurate targ
 et training set. Our theory also gives results when we have multiple source
  domains, each of which may have a different number of instances, and we ex
 hibit cases in which minimizing a non-uniform combination of source risks c
 an achieve much lower target error than standard empirical risk minimizatio
 n.<br><br>Joint work with Koby Crammer, Alex Kulesza, Fernando Pereira, &am
 p; Jenn Wortman.<br><br>Food: Mama's Vegetarian
LOCATION:Levine 512
END:VEVENT
BEGIN:VEVENT
DTSTART:20071017T123000
DTEND:20071017T133000
SUMMARY:READING GROUP: Semi-supervised Learning on Riemannian Manifolds
DESCRIPTION:presented by Ted Sandler, CIS<br><br>Mikhail Belkin, Partha Niyogi<br><br>W
 e consider the general problem of utilizing both labeled and unlabeled data
  to improve classification accuracy. Under the assumption that the data lie
  on a submanifold in a high dimensional space, we develop an algorithmic fr
 amework to classify a partially labeled data set in a principled manner. Th
 e central idea of our approach is that classification functions are natural
 ly defined only on the submanifold in question rather than the total ambien
 t space. Using the Laplace-Beltrami operator one produces a basis (the Lapl
 acian Eigenmaps) for a Hilbert space of square integrable functions on the 
 submanifold. To recover such a basis, only unlabeled examples are required.
  Once such a basis is obtained, training can be performed using the labeled
  data set.<br><br>Our algorithm models the manifold using the adjacency gra
 ph for the data and approximates the Laplace- Beltrami operator by the grap
 h Laplacian. We provide details of the algorithm, its theoretical justifica
 tion, and several practical applications for image, speech, and text classi
 fication.<br><br>Link: http://www.cse.ohio-state.edu/~mbelkin/paper2.pdf<br
 ><br>Food: Sushi
LOCATION:Levine 512
END:VEVENT
BEGIN:VEVENT
DTSTART:20071024T123000
DTEND:20071024T140000
SUMMARY:Finding cohesive clusters for analyzing knowledge communities
DESCRIPTION:Bill Kandylas, CIS<br><br>Documents and authors can be clustered into ``kno
 wledge communities'' based on the overlap in the papers they cite. We intro
 duce a new clustering algorithm, Streemer, which finds cohesive foreground 
 clusters embedded in a diffuse background, and use it to identify knowledge
  communities as foreground clusters of papers which share common citations.
  To analyze the evolution of these communities over time, we build predicti
 ve models with features based on the citation structure, the vocabulary of 
 the papers, and the affiliations and prestige of the authors. Findings incl
 ude that scientific knowledge communities tend to grow more rapidly if thei
 r publications build on diverse information and if they use a narrow vocabu
 lary.<br><br>Food: Chinese
LOCATION:Levine 512
END:VEVENT
BEGIN:VEVENT
DTSTART:20071031T123000
DTEND:20071031T133000
SUMMARY:READING GROUP: The Rademacher Complexity of Co-Regularized Kernel Classes
DESCRIPTION:presented by John Blitzer, CIS<br><br>David Rosenberg and Peter L. Bartlett
 <br><br>In the multi-view approach to semi-supervised learning, we choose o
 ne predictor from each of multiple hypothesis classes, and we `co-regulariz
 e' our choices by penalizing disagreement among the predictors on the unlab
 eled data. In this paper we examine the co-regularization method used in th
 e recently proposed co-regularized least squares (CoRLS) algorithm. In this
  method we have two hypothesis classes, each a reproducing kernel Hilbert s
 pace (RKHS), and we co-regularize by penalizing the average squared differe
 nce in predictions on the unlabeled data. We get our final predictor by tak
 ing the pointwise average of the predictors from each view. We call the set
  of predictors that can result from this procedure the co-regularized hypot
 hesis class. The main result of this paper is a tight bound on the Rademach
 er complexity of the co-regularized hypothesis class in terms of the kernel
  matrices of each RKHS. We find that the co-regularization reduces the Rade
 macher complexity of the hypothesis class by an amount depending on how dif
 ferent the two views are, measured by a data dependent metric. We then use 
 standard techniques to bound the gap between training error and test error 
 for the CoRLS algorithm. Experimentally, we find that the amount of reducti
 on in complexity introduced by co-regularization correlates with the amount
  of improvement that co-regularization gives in the CoRLS algorithm.<br><br
 >Link: http://www.stat.berkeley.edu/~bartlett/papers/rb-rcckc-06.pdf<br><br
 >Food: Shalom's pizza
LOCATION:Levine 512
END:VEVENT
BEGIN:VEVENT
DTSTART:20071107T123000
DTEND:20071107T133000
SUMMARY:READING GROUP: Gaussian mean shift is an EM algorithm
DESCRIPTION:presented by Kuzman Ganchev, CIS<br><br>Miguel A. Carreira-Perpinan<br><br>
 The mean-shift algorithm, based on ideas proposed by Fukunaga and Hostetler
  (1975), is a hill-climbing algorithm on the density defined by a finite mi
 xture or a kernel density estimate. Mean-shift can be used as a nonparametr
 ic clustering method and has attracted recent attention in computer vision 
 applications such as image segmentation or tracking. We show that, when the
  kernel is Gaussian, mean-shift is an expectation maximisation (EM) algorit
 hm, and when the kernel is non-gaussian, mean-shift is a generalised EM alg
 orithm. This implies that mean-shift converges from almost any starting poi
 nt and that, in general, its convergence is of linear order. For Gaussian m
 ean-shift we show: (1) the rate of linear convergence approaches 0 (superli
 near convergence) for very narrow or very wide kernels, but is often close 
 to 1 (thus extremely slow) for intermediate widths, and exactly 1 (sublinea
 r convergence) for widths at which modes merge; (2) the iterates approach t
 he mode along the local principal component of the data points from the ins
 ide of the convex hull of the data points; (3) the convergence domains are 
 nonconvex and can be disconnected and show fractal behaviour. We suggest wa
 ys of accelerating mean-shift based on the EM interpretation.<br><br>Link: 
 http://www.csee.ogi.edu/~miguel/papers/pami07.pdf<br><br>Food: Indian
LOCATION:Levine 512
END:VEVENT
BEGIN:VEVENT
DTSTART:20071114T123000
DTEND:20071114T140000
SUMMARY:Hidden Factor Models and Their Applications in Autonomic Computing
DESCRIPTION:Irina Rish, IBM<br><br>Many important tasks arising in autonomic computing,
  such as problem diagnosis, performance prediction and resource allocation 
 in complex distributed compter systems and networks, require making inferen
 ces about unobserved system performance characteristics from the observed o
 nes, and especially relating global performance measurements such as outcom
 es of the end-to-end transactions (probes) and local measures such as link 
 and node availability and performance (the problem often referred to as net
 work tomography in the network management context). Clearly, there are many
  factors affecting the performance that may not be known and/or explicitly 
 represented in the system's model, thus motivating us to explore hidden fac
 tor models. In this talk, I will discuss several applications of hidden fac
 tor models to autonomic computing, including 1) active learning with linear
  factor models using max-margin matrix factorization (MMMF) and its applica
 tions to server selection in content distribution systems, and related coll
 aborative prediction problems, and 2) blind source separation (BSS) using n
 on-negative matrix factorization for network bottleneck diagnosis in the ab
 sence of dependency (routing) information.<br><br>If time permits, I will a
 lso discuss our most recent work on hidden-factor models for supervised dim
 ensionality reduction (SDR) that combine feature extraction (dimensionality
  reduction) with learning a predictive model in a unified optimization fram
 ework, and generalize to various data types, by combining appropriate gener
 alized linear models (GLMs) for features and class, and exploiting the addi
 tive property of auxiliary functions to derive combined auxiliary functions
  that lead to provably convergent algorithms. <br><br>Food: Sushi
LOCATION:Levine 512
END:VEVENT
BEGIN:VEVENT
DTSTART:20071121T123000
DTEND:20071121T140000
SUMMARY:Blind channel identification for speech dereverberation using L1-norm sparse learning
DESCRIPTION:Yuanqing Lin, ESE<br><br>Speech dereverberation remains an open problem aft
 er more than three decades of research. The most challenging step in speech
  dereverberation is blind channel identification (BCI). Although many BCI a
 pproaches have been developed, their performance is still far from satisfac
 tory for practical applications. The main difficulty in BCI lies in finding
  an appropriate acoustic model, which not only can effectively resolve solu
 tion degeneracies due to the lack of knowledge of the source, but also robu
 stly models real acoustic environments. This paper proposes a sparse acoust
 ic room impulse response (RIR) model for BCI, that is, an acoustic RIR can 
 be modeled by a sparse FIR filter. Under this model, we show how to formula
 te the BCI of a single-input multiple-output (SIMO) system into a L1-norm r
 egularized least squares (LS) problem, which is convex and can be solved ef
 ficiently with guaranteed global convergence. The sparseness of solutions i
 s controlled by L1-norm regularization parameters. We propose a sparse lear
 ning scheme that infers the optimal L1-norm regularization parameters direc
 tly from microphone observations under a Bayesian framework. Our results sh
 ow that the proposed approach is effective and robust, and it yields source
  estimates in real acoustic environments with high fidelity to anechoic cha
 mber measurements.<br><br>This talk will be a 20 minute rehearsal for NIPS 
 2007.<br><br>Food: Chinese
LOCATION:Levine 512
END:VEVENT
BEGIN:VEVENT
DTSTART:20071128T120000
DTEND:20071128T133000
SUMMARY:User Models for Email Activity Management
DESCRIPTION:Mark Dredze, CIS<br><br>A single user activity, such as planning a conferen
 ce trip, typically involves multiple actions. Although these actions may in
 volve several applications, the central point of co-ordination for any part
 icular activity is usually email. Previous work on email activity managemen
 t has focused on combining supervised classifiers to cluster emails by acti
 vity. In this paper, we take a different approach and present an unsupervis
 ed framework for email activity clustering. We combine three information so
 urces---namely, document similarity, message recipients and authors, and th
 read information---to form an unsupervised, non-parametric Bayesian user mo
 del. This approach enables email activities to be inferred without any user
  input. We evaluate this approach using 1175 labeled email messages, showin
 g that our model does well at clustering emails by activity.<br><br>The pur
 pose of this talk will be to solicit feedback and suggestions for future wo
 rk. Specifically, I will talk about what worked, what didn't work, and what
  we can do next. <br><br>Food: Maccabeam
LOCATION:Levine 512
END:VEVENT
BEGIN:VEVENT
DTSTART:20071205T123000
DTEND:20071205T140000
SUMMARY:Bottom-up Recognition and Parsing of the Human Body
DESCRIPTION:Praveen Srinivasan, CIS<br><br>Recognizing humans, estimating their pose an
 d segmenting their body parts are key to high-level image understanding. Be
 cause humans are highly articulated, the range of deformations they undergo
  makes this task extremely challenging. Previous methods have focused large
 ly on heuristics or pairwise part models in approaching this problem. We pr
 opose a bottom-up growing, similar to parsing, of increasingly more complet
 e partial body masks guided by a composition tree. At each level of the gro
 wing process, we evaluate the partial body masks directly via shape matchin
 g with exemplars (and also image features), without regard to how the hypot
 heses are formed. The body is evaluated as a whole, not the sum of its part
 s, unlike previous approaches. Multiple image segmentations are included at
  each of the levels of the growing/parsing, to augment existing hypotheses 
 or to introduce ones. Our method yields both a pose estimate as well as a s
 egmentation of the human. We demonstrate competitive results on this challe
 nging task with relatively few training examples on a dataset of baseball p
 layers with wide pose variation. Our method is comparatively simple and cou
 ld be easily extended to other objects. We also give a learning framework f
 or parse ranking that allows us to keep fewer parses for similar performanc
 e. <br><br>Food: Thai
LOCATION:Levine 512
END:VEVENT
BEGIN:VEVENT
DTSTART:20071212T123000
DTEND:20071212T140000
SUMMARY:DISSERTATION DEFENSE: Domain adaptation of natural language processing systems
DESCRIPTION:John Blitzer, CIS<br><br>Statistical language processing tools are being ap
 plied to an ever wider and more varied range of linguistic data.  Researche
 rs and engineers are using statistical models to organize and understand fi
 nancial news, legal documents, biomedical abstracts, and weblog entries, am
 ong many other domains.  Many of these models are supervised at parameter e
 stimation time -- A human annotator must create a training set of examples 
 for the relevant task.  Because natural language varies so widely, collecti
 ng and curating training sets for each different domain is prohibitively ex
 pensive.  At the same time, differences in vocabulary and writing style acr
 oss domains can cause state-of-the-art supervised models to dramatically in
 crease in error.<br><br>The first part of this thesis describes structural 
 correspondence learning (SCL), a method for adapting linear discriminative 
 models from resource-rich source domains to resource-poor target domains.  
 The key idea is the use of pivot features which occur frequently and behave
  similarly in both the source and target domains.  SCL builds a shared repr
 esentation by searching for a low-dimensional feature subspace that allows 
 us to accurately predict the presence or absence of pivot features on unlab
 eled data.  Because pivot features behave similarly in the source and targe
 t domains, this subspace also provides a good feature representation for do
 main adaptation.<br><br>We demonstrate SCL on two natural language processi
 ng problems.  The first is sentiment classification of product reviews.  We
  use SCL to adapt classifiers trained on reviews of one product type to per
 form well on reviews of a different product type.  The second task is part 
 of speech tagging.  Part of speech taggers are a basic building block for m
 any text processing pipelines, and we show how to use SCL to adapt a tagger
  trained on a financial news corpus to a corpus of MEDLINE abstracts.  For 
 both tasks, SCL significantly improves over state of the art supervised mod
 els using only unlabeled target data. In situations where we have small amo
 unts of labeled target data, SCL can be combined with standard techniques t
 o give even further improvement.<br><br>In the second part of the thesis, w
 e develop a formal framework for analyzing domain adaptation tasks.  Statis
 tical learning theory yields bounds on the expected error of a model which 
 is trained on a sample from a distribution and tested on another sample fro
 m the same distribution.  For domain adaptation, however, the target sample
  is drawn from a different distribution than the source training sample, an
 d the assumptions that underly standard learning theory break down. We begi
 n by stating a generalization bound for the error of a model trained on a s
 ample from a source distribution and tested on a sample from a target distr
 ibution that depends on the divergence between the two distributions.  We d
 evelop a measure of divergence, the $HDeltaH$-divergence, that depends on t
 he hypothesis class $H$ from which we estimate our supervised model.  Unlik
 e standard measures of divergence, such as total variation or Kullback-Leib
 ler, the $HDeltaH$-divergence can be estimated from finite samples of unlab
 eled data.<br><br>We show that the representation learned by SCL results in
  a lower $HDeltaH$-divergence between source and target domains than the st
 andard feature representation.  We also analyze the domain adaptation setti
 ng in which we have a small amount of target labeled data by proving an upp
 er bound on the true target error of a model trained to minimize a convex c
 ombination of empirical source and target errors.  The bound describes an i
 ntuitive tradeoff inherent in training on both the large quantity of biased
  source data and the small quantity of unbiased target data, and we can com
 pute it from finite labeled and unlabeled samples of the source and target 
 distributions under relatively weak assumptions. <br><br>Food: Mama's Veget
 arian
LOCATION:Levine 512
END:VEVENT
BEGIN:VEVENT
DTSTART:20071218T120000
DTEND:20071218T133000
SUMMARY:Generating Summary Keywords for Emails Using Topics
DESCRIPTION:Mark Dredze, CIS<br><br>Joint work with Hanna Wallach, Danny Puller, Fernan
 do Pereira. <br><br>Email summary keywords, used to concisely represent the
  gist of an email, can help users manage and prioritize large numbers of me
 ssages. We develop an unsupervised learning framework for selecting summary
  keywords from emails using latent representations of the underlying topics
  in a user's mailbox. This approach selects words that describe each messag
 e in the context of existing topics rather than simply selecting keywords b
 ased on a single message in isolation. We present and compare four methods 
 for selecting summary keywords based on two well-known models for inferring
  latent topics: latent semantic analysis and latent Dirichlet allocation. T
 he quality of the summary keywords is assessed by generating summaries for 
 emails from twelve users in the Enron corpus. The summary keywords are then
  used in place of entire messages in two proxy tasks: automated foldering a
 nd recipient prediction. We also evaluate the extent to which summary keywo
 rds enhance the information already available in a typical email user inter
 face by repeating the same tasks using email subject lines.<br><br>Food: Ma
 ma's
LOCATION:Towne 225
END:VEVENT
BEGIN:VEVENT
DTSTART:20080130T123000
DTEND:20080130T140000
SUMMARY:Gaussian Process Product Models for Nonparametric Nonstationarity
DESCRIPTION:Ryan Adams, Cambridge<br><br>Gaussian process priors are a powerful tool fo
 r nonlinear Bayesian regression.  Simple stationary covariance functions ar
 e popular, but stationarity is often a strong assumption.  I will discuss t
 he Gaussian process product model, a way to implement nonstationary functio
 n amplitude while maintaining many of the intuitions of stationary regressi
 on.  We use expectation propagation, a variational approximation technique,
  to perform tractable inference in the model and enable hyperparameter sele
 ction.   This is joint work with Oliver Stegle.<br><br>Food: Chinese
LOCATION:Levine 512
END:VEVENT
BEGIN:VEVENT
DTSTART:20080213T123000
DTEND:20080213T140000
SUMMARY:Machine Learning @ Penn webpage discussion
DESCRIPTION:Everyone, <br><br>We'll discuss the organization, design, and upkeep of a n
 ew and awesome machine learning webpage.  Everyone who does stuff related t
 o ML is invited to have some free food and contribute ideas.<br><br>Food: C
 osi sandwiches
LOCATION:Levine 512
END:VEVENT
BEGIN:VEVENT
DTSTART:20080220T123000
DTEND:20080220T140000
SUMMARY:Learning with Locally Linear Feature Regularization
DESCRIPTION:Ted Sandler, CIS<br><br>Machine learning models are successfully being used
  for classification problems in language, vision, and biology that have mil
 lions or tens of millions of features.  A common approach to alleviating th
 e complexity of high dimensional feature spaces is to penalize the L_1 or L
 _2 norm of the parameter vector.  We may be able to design more effective r
 egularizers, though, if we possess external information about which feature
 s behave similarly with respect to the class labels.  For example, word co-
 occurrence statistics or thesauri such as Wordnet can indicate similarities
  between words that prove useful in predicting the topic or sentiment of a 
 document.  Biological databases of gene pathways provide information about 
 which sets of genes are likely to cause disease in concert.  Given such inf
 ormation about feature similarities, it seems reasonable to learn models in
  which similar weights are assigned to similar features.<br><br>Here we pre
 sent a simple regularization framework for doing just this. Similarities be
 tween features are encoded as a feature graph and a classifier is learned i
 n which each feature's coefficient is preferred to be the average of the co
 efficients of its neighboring features. The regularization criterion is clo
 sely related to locally linear embedding, a method for learning low dimensi
 onal embeddings of high-dimensional data (Saul and Roweis, 2001).  Because 
 of this, we call our method locally linear feature regularization (LLFR). E
 xperiments using LLFR for document classification show that regularizing us
 ing feature graphs learned from unlabeled data can yield large increases in
  accuracy.  We also discuss ongoing work at extending this success to other
  domains.<br><br>Joint work with John Blitzer and Lyle Ungar.<br><br>Food: 
 Thai
LOCATION:Levine 512
END:VEVENT
BEGIN:VEVENT
DTSTART:20080305T123000
DTEND:20080305T140000
SUMMARY:READING GROUP: Compressed Sensing
DESCRIPTION:presented by Ben Taskar, CIS<br><br>This week at MLUNCH Reading Group, Ben 
 Taskar will present a tutorial on compressive sensing by Richard Baraniuk, 
 Justin Romberg and Michael Wakin.<br><br>Link: <br><br>Food: Sushi
LOCATION:Levine 512
END:VEVENT
BEGIN:VEVENT
DTSTART:20080319T123000
DTEND:20080319T140000
SUMMARY:Learning from Collective Behavior 
DESCRIPTION:Jenn Wortman, CIS<br><br>Inspired by longstanding lines of research in soci
 ology and related fields, and by more recent large-population human subject
  experiments on the Internet and the Web, we initiate a study of the comput
 ational issues in learning to model collective behavior from observed data.
   We define formal models for efficient learning in such settings, and prov
 ide both general theory and specific learning algorithms for these models.<
 br><br>This talk is based on joint work with Michael Kearns. <br><br>Food: 
 Thai
LOCATION:Levine 512
END:VEVENT
BEGIN:VEVENT
DTSTART:20080326T123000
DTEND:20080326T140000
SUMMARY:Confidence-Weighted Linear Classification
DESCRIPTION:Koby Crammer, CIS<br><br> We introduce confidence-weighted linear classifie
 rs, a new class of algorithms that maintain confidence information about cl
 assifier parameters. Instead of a single weight vector, learned hypotheses 
 are given by Gaussian distributions over weight vectors, with a covariance 
 matrix that represents uncertainty about weights and correlations between d
 ifferent weights. Learning in this framework updates parameters by estimati
 ng weights and increasing model confidence.<br><br>We investigate a new onl
 ine algorithm that maintains a Gaussian distribution over weight vectors, u
 pdating the mean and variance of the model with each instance. Our analysis
  of the algorithm in the mistake-bound model improves over previous bounds 
 for some cases. In this analysis, we demonstrate the intimate connections b
 etween our model and the margin and loss analyses of previous models.<br><b
 r>Empirical evaluation on a range of NLP tasks show that our algorithm impr
 oves over other state of the art online and batch methods, learns faster in
  the online setting, ends itself to better classifier combination after par
 allel training and is suites better for active learning.<br><br>Based on jo
 int work with Mark Dredze and Fernando Pereira. <br><br>Food: Shalom Pizza 
 with the regular toppings
LOCATION:Levine 512
END:VEVENT
BEGIN:VEVENT
DTSTART:20080402T123000
DTEND:20080402T140000
SUMMARY:READING GROUP: The Dantzig selector: statistical estimation  when p is much larger than n
DESCRIPTION:presented by Ben Taskar, CIS<br><br>In many important statistical applicati
 ons, the number of variables or parameters p  is much larger than the numbe
 r of observations n. Suppose then that we have observations y = XB + z , wh
 ere B is a parameter vector of interest, X is a data matrix  with possibly 
 far fewer rows than columns, n is much less than p, and the z_i's are i.i.d
 . N(0,sigma^2). Is it  possible to estimate B reliably based on the noisy d
 ata y?<br><br>To estimate B, we introduce a new estimator--we call the Dant
 zig selector--which is solution to an L1-regularization problem.  We show t
 hat if  X obeys a uniform uncertainty principle (with unit-normed columns) 
 and if the true parameter vector B is sufficiently sparse (which here rough
 ly guarantees that the model is identifiable), then with very large probabi
 lity the estimate is close.  Our results are nonasymptotic.  Even though n 
 may be much smaller than p, our estimator achieves a loss within a logarith
 mic factor of the ideal mean squared error one would achieve with an oracle
  which would supply  perfect information about which coordinates are nonzer
 o, and which were above the noise level. <br><br>In multivariate regression
  and from a model selection viewpoint, our result says that it is possible 
 nearly to select the best subset of variables, by solving a very simple con
 vex program, which in fact can easily be recast as a convenient linear prog
 ram (LP). <br><br>Link: http://www.acm.caltech.edu/~emmanuel/papers/Dantzig
 Selector.pdf<br><br>Food: Thai
LOCATION:Levine 512
END:VEVENT
BEGIN:VEVENT
DTSTART:20080409T123000
DTEND:20080409T140000
SUMMARY:Active Error Correction for Learning Kinship Terms 
DESCRIPTION:Gary Morris, CIS<br><br>Kinship Analysis requires learning definitions of a
 ll kinship terms used in a target culture from data that is gathered increm
 entally through interviews with informants. We exploit a collection of prev
 iously learned kinship definitions from different cultures (logical &quot;d
 omain theories&quot;) to minimize the cost of learning the definitions in a
  new domain theory. We use Transfer Learning and Inductive Logic Programmin
 g (ILP) to learn kinship definitions.<br><br>We propose a novel method for 
 identifying potential errors in the data by comparing our data with definit
 ions in previously learned models of similar cultures. We actively ask info
 rmants to confirm or correct these potential errors. This Active Error Corr
 ection is enhanced by computing the similarity between the target domain th
 eory and those of other cultures, and using these similarities to guide the
  search for definitions . Our similarity-guided Active Error Correction lea
 rns more than 90% of the kin terms, even when training examples are up to 1
 9% incorrect. Suppressing error correction reduces learning accuracy by 29%
 . Because Active Error Correction correctly identifies almost all of the er
 rors in the data for human correction, it reduces the need to collect furth
 er data by 20%. Furthermore, Similarity-Guided Search reduces repeat data r
 equests of the User by 40%.<br><br>The combination of Active Error Correcti
 on and Similarity-Guided Search can be used to reduce data requirements for
  ILP learning problems. As a side benefit, this work has produced a machine
  readable collection of 50 kinship domain theories that may have broader be
 nefits in anthropology, by surfacing interesting anthropological issues.  <
 br><br>Food: Mexican
LOCATION:Levine 512
END:VEVENT
BEGIN:VEVENT
DTSTART:20080416T123000
DTEND:20080416T140000
SUMMARY:The skew spectrum of graphs --- a new class of graph invariants
DESCRIPTION:Risi Kondor, Gatsby<br><br>One of the central issues in representing graph 
 structured data instances in machine learning and other computational field
 s is designing features that are invariant to permuting the labeling of the
  vertices. We present a new, powerful system of relabeling invariant graph 
 features called the skew spectrum of graphs.<br><br>Most of the graph invar
 iants in practical use today are based on either counting local features (s
 ubgraphs, shortest paths, etc) or manipulating the eigenvalues of the graph
  Laplacian. In contrast, the skew spectrum is an algebraic approach based o
 n the action of permutations on the adjacency matrix regarded as a function
  on a specific homogeneous space of the symmetric group.<br><br>For computa
 tional efficiency, we introduce a variant called the reduced skew spectrum 
 which is computable in n3 time. Despite the fact that the reduced skew spec
 trum has only a constant number of scalar components (to be exact, 49), it 
 seems to be surprisingly good at discriminating between graphs.  Complete e
 numeration of simple graphs of small size shows that non-isomorphic graphs 
 having identical skew spectra is very rare. Experiments on standard molecul
 e classification tasks show that the skew spectrum is as good or better tha
 n state of the art graph kernels for representing graphs up to about n=200.
 <br><br>Food: Indian
LOCATION:Levine 512
END:VEVENT
BEGIN:VEVENT
DTSTART:20080423T123000
DTEND:20080423T140000
SUMMARY:Gene expression regulatory networks in human cells
DESCRIPTION:Renuka Nayak, BGS<br><br>Cells establish gene expression regulatory network
 s to orchestrate cellular processes.  Elucidating these gene expression reg
 ulatory networks remains a challenge.  Here, we approximate the gene expres
 sion regulatory network in human cells by integrating multiple types of gen
 ome-wide data.  We first obtained a crude approximation of the gene express
 ion regulatory network by constructing a &quot;co-expression network,&quot;
  using gene expression data from normal, unrelated cell lines (N=148).  To 
 construct this network, connections (i.e. edges) were drawn between any two
  genes that were significantly correlated in expression (|R| &gt; 0.50) acr
 oss unrelated individuals.  Once constructed, structural properties of this
  network were analyzed, such as the number of connections per gene.  Most g
 enes were connected to 50 other genes on average, but a few genes were conn
 ected to 100's of genes.  The most connected gene, with 314 connections, wa
 s NFKB1, a gene known to be essential to these cell lines.  While this co-e
 xpression network depicted correlative relationships between genes, we aime
 d to improve the network by identifying regulatory relationships.  We sough
 t to infer regulatory relationships between genes by incorporating results 
 from analyses performed in our lab as well as using data from external data
 bases. Focusing on NFKB1, we integrated multiple types of data to infer reg
 ulatory relationships between NFKB1 and its neighbors.  After integrating t
 hese data, we performed an experiment to further support a plausible regula
 tory relationship between NFBK1 and a gene which is implicated as a target 
 of NFKB1 in our analyses.   Our findings suggest that examining correlation
  in gene expression under a network framework can identify critical compone
 nts of the regulatory network in human cells, and that integrating multiple
  types of data can identify specific, pairwise regulatory relationships.<br
 ><br>Joint work with Michael Kearns and Vivian Cheung.<br><br>Food: Mexican
LOCATION:Levine 512
END:VEVENT
BEGIN:VEVENT
DTSTART:20080430T123000
DTEND:20080430T140000
SUMMARY:A Rate-Distortion One-Class Model and its Applications to Clustering
DESCRIPTION:Partha Pratim Talukdar, CIS<br><br>We study the problem of one-class classi
 fication, in which we seek a rule to separate a coherent subset of instance
 s similar to a few positive examples from a large pool of instances. We fin
 d that the problem can be formulated naturally in terms of a rate-distortio
 n tradeoff, which can be analyzed precisely and leads to an efficient algor
 ithm that competes well with two previous one-class methods. We also show t
 hat our model can be extended naturally to clustering problems in which it 
 is important to remove background clutter to improve cluster purity.<br><br
 >Joint work with Koby Crammer, Fernando Pereira. <br><br>Food: Indian (kosh
 er, vegetarian)
LOCATION:Levine 512
END:VEVENT
BEGIN:VEVENT
DTSTART:20080507T123000
DTEND:20080507T140000
SUMMARY:Online Methods for Multi-Domain Learning and Adaptation 
DESCRIPTION:Mark Dredze, CIS<br><br>Many NLP tasks require Multi-Domain learning, where
 by the system must learn across multiple data sources or domains, adapting 
 to new domains and sharing learned information across domains. We develop a
  new multi-domain online learning framework based on parameter combination 
 from multiple classifiers. Our algorithms can adapt from multiple source do
 mains to a new target domain, learn across multiple similar domains, and le
 arn across a large number of divergent domains. We evaluate our algorithms 
 on two popular NLP domain adaptation tasks: sentiment classification and sp
 am filtering. <br><br>Food: Shalom Pizza
LOCATION:Levine 512
END:VEVENT
BEGIN:VEVENT
DTSTART:20080514T123000
DTEND:20080514T140000
SUMMARY:Pulling Rank: Inference from Incomplete Data
DESCRIPTION:Ben Recht, IST/Caltech<br><br>How can we infer answers from a survey that i
 s only   partially filled  out? Suppose we ask a large collection of indivi
 duals a series of  questions.  We collect some data but, unfortunately, man
 y questions  are left unanswered. Is it possible for us to make an  educate
 d guess about what the missing answers should be? How can we make such a gu
 ess?  In general, with no additional assumptions, this is impossible.  Howe
 ver, if we assume that we can arrange all of the answers into a low rank ma
 trix, there is often a unique assignment for the missing entries.   The ran
 k of a matrix is equal to the dimension of the span of its columns (or rows
 ).  Matrix rank is often an efficient way  to describe system order, comple
 xity, or dimensionality.  Moreover, matrices of   very low rank can be uniq
 uely determined from very few measurements.  Consequently, rank minimizatio
 n---finding the minimum rank matrix that agrees with some partial measureme
 nts---is a recurring problem in engineering and computer science.  It arise
 s in a diverse set of fields including collaborative filtering, Euclidean e
 mbedding, multi-task learning, system identification, and controller design
 .  Unfortunately, although specific instances can often be solved with spec
 ialized algorithms, the general rank minimization problem is NP-hard.<br><b
 r>In this talk, I will discuss a convex relaxation of the rank minimization
  problem that minimizes the nuclear norm (also known as the  Schatten 1 nor
 m or the trace norm) over the given  constraint set.  I will show that this
  heuristic is guaranteed to find the  minimum rank solution in a variety of
  practical scenarios.  In particular, one can perfectly recover most low-ra
 nk matrices from a small partial subset of entries.  I will subsequently ou
 tline practical implementations of this heuristic that can be efficiently a
 pplied to large scale problems with guaranteed success.  The techniques use
 d in this analysis have strong parallels in the compressed sensing framewor
 k where one seeks to find the vector of smallest cardinality in an affine s
 et. I will  discuss how rank minimization generalizes this pre-existing con
 cept and outline a dictionary relating concepts from cardinality  minimizat
 ion to those of rank minimization.<br><br>Food: Thai
LOCATION:Levine 307
END:VEVENT
BEGIN:VEVENT
DTSTART:20080918T113000
DTEND:20080918T124500
SUMMARY:The &quot;Dinosaur Planet&quot; Approach to the Netflix Prize
DESCRIPTION:David Weiss, CIS<br><br>The Netflix Prize is an international machine learn
 ing competition organized by the Netflix Corporation with the goal of devel
 oping a highly accurate algorithm to predict a given customer's movie prefe
 rences from the collection of all known customer preferences. The Grand Pri
 ze of the competition is a cash award of $1 million, and the competition ha
 s interested thousands of participants across the globe. To facilitate deve
 lopment, Netflix has released a massive dataset comprising 100 million trai
 ning examples, the largest publicly available dataset of its kind.<br><br>I
 n this presentation, we describe the ensemble approach used by our team to 
 achieve a second place finish in the 2006-2007 year of the competition. Our
  approach involves three general phases. During ensemble training, we apply
  a large number of variants of dimensionality reduction and regression algo
 rithms, such as clustering, matrix factorization, restricted boltzmann mach
 ines, and kNN, to the Netflix dataset. To generate a single prediction from
  the ensemble, we then apply a variety of linear regression techniques, and
  we explore a large space of potential interactions between predictors and 
 other aspects of training data (e.g., the support or bias of individual use
 rs or movies.) Finally, we also explore a set of domain-specific pre- and p
 ost-processing methods that we show significantly improve the accuracy of o
 ur algorithms.<br><br>Biography: David Weiss is a new Ph.D student here at 
 CIS. His research interests are in statistical machine learning, with appli
 cations to neuroscience. David received a Bachelor's degree in Computer Sci
 ence with a Certificate in Neuroscience from Princeton University in 2007. 
 Before attending graduate school, he worked as a researcher at the Princeto
 n Computational Memory Lab, developing methods for brain state classificati
 on of functional neuroimaging data. In October of 2006, David started Team 
 Dinosaur Planet, along with fellow Princetonians David Lin and Lester Macke
 y, to compete in the Netflix Prize competition.<br><br>Food: Indian
LOCATION:Levine 315
END:VEVENT
BEGIN:VEVENT
DTSTART:20080925T120000
DTEND:20080925T133000
SUMMARY:Convex relaxations for Markov Random Field MAP estimation
DESCRIPTION:Timothy Cour, GRASP<br><br>Markov Random Fields (MRF) are commonly used in 
 computer vision and maching learning applications to model interactions of 
 interdependant variables. Finding the Maximum Aposteriori (MAP) solution of
  an MRF is in general intractable, and one has to resort to approximate sol
 utions. I will review some of the recent literature on convex relaxations f
 or MAP estimation, which is shown equivalent to a real-valued quadratic pro
 gram. There are two main strategies: (1) optimize a convex upper-bound of t
 he (non-convex) cost function (L2QP, CQP, our spectral relaxation); (2) ref
 ormulate as a linear objective using Lift and Project and optimize over a c
 onvex upper-bound of the (non-convex) feasible set (SDP,SOCP,LP relaxations
 ).<br><br>I will analyse those relaxations according to 5 fundamental crite
 ria:  (1) optimality cases (A: depends on topology: tree/single cycle/plana
 r AND numerical values: submodular ...); (2) relative dominance relationshi
 ps; (3) multiplicative/additive bounds on the quality of the approximation;
  (4) ability to handle arbitrary clique size; (5) space/time complexity and
  convergence guarantees.<br><br>I will show the connection between message-
 passing algorithms (BP, TRW, dual decomposition), reparameterization, and L
 agrangian dual of the LP. If time permits, I will present our recent work o
 n a general Lagrangian dual formulation in terms of redundant quadratic con
 straints, which depending on the set we choose, reduces to L2QP, spectral, 
 CQP, SDP, ... and show how it can potentially lead to tighter relaxations. 
 <br><br>Food: Thai
LOCATION:Levine 512
END:VEVENT
BEGIN:VEVENT
DTSTART:20081002T120000
DTEND:20081002T133000
SUMMARY:READING GROUP: On the Equivalence of Weak Learnability and Linear Separability: New Relaxations and Efficient Boosting Algorithms
DESCRIPTION:presented by Alex Kulesza, CIS<br><br>Boosting algorithms build highly accu
 rate prediction mechanisms from a collection of low accuracy predictors. To
  do so, they employ the notion of weak-learnability. The starting point of 
 this paper is a proof which shows that weak learnability is equivalent to l
 inear separability with L1 margin. While this equivalence is a direct conse
 quence of von Neumann's minimax theorem, we derive the equivalence directly
  using Fenchel duality. We then use our derivation to describe a family of 
 relaxations to the weak-learnability assumption that readily translates to 
 a family of relaxations of linear separability with margin. This alternativ
 e perspective sheds new light on known soft-margin boosting algorithms and 
 also enables us to derive several new relaxations of the notion of linear s
 eparability. Last, we describe and analyze an efficient boosting framework 
 that can be used for minimizing the loss functions derived from our family 
 of relaxations. In particular, we obtain efficient boosting algorithms for 
 maximizing hard and soft versions of the L1 margin.<br><br>Link: http://col
 t2008.cs.helsinki.fi/papers/26-Shwartz.pdf<br><br>Food: TBD
LOCATION:Levine 512
END:VEVENT
BEGIN:VEVENT
DTSTART:20081009T120000
DTEND:20081009T133000
SUMMARY:Regret Minimization with Concept Drift 
DESCRIPTION:Jenn Wortman, CIS<br><br>Consider the classical problem of online classific
 ation.   At each time step, the learner is given a new data instance (for e
 xample, an email) and must output a prediction of its label (for example, &
 quot;spam&quot; or &quot;not spam&quot;).  The true label of the instance i
 s then revealed, and the learner suffers a loss based on both the true labe
 l and its own prediction.  The goal of the learner is generally to achieve 
 an average loss that is &quot;not too big&quot; compared to the loss it wou
 ld have received if it had always chosen to predict according to the best-p
 erforming function from a fixed class F.  It is well-known that as the numb
 er of time steps grows, very simple aggregation algorithms are able to achi
 eve an average error arbitrarily close to the error of the best function in
  retrospect.  Furthermore, such guarantees hold even if the input and outpu
 t pairs are chosen in a fully adversarial manner.<br><br>However, despite t
 he extensive literature on no-regret learning and the impressive guarantees
  that can be made, competing with the best fixed function is not always goo
 d enough.  For example, in many real-world applications, the true target fu
 nction is not fixed, but is slowly changing over time.  Consider a classifi
 er designed to identify news articles related to China.  Over time, the mos
 t relevant topics might drift from exports to the Olympics to finance.  Whe
 n this drift occurs, the classifier itself must shift in order to remain re
 levant. Similarly, the very definition of spam is changing over time as spa
 mmers become more creative and deviant.  Any useful spam filter must evolve
  to keep up with this drift.<br><br>With such applications in mind, we deve
 lop a new theoretical model for regret minimization with concept drift.  In
  this model, the goal of the learner is no longer to compete well with a si
 ngle function, but to maintain an average error close to that of the best s
 lowly changing sequence of functions.  In this talk, I will provide an over
 view of this model and discuss several new algorithmic results for learning
  in the presence of concept drift.<br><br>This talk is based on joint work 
 with Koby Crammer, Eyal Even-Dar, and Yishay Mansour. <br><br>Food: Cosi
LOCATION:Levine 512
END:VEVENT
BEGIN:VEVENT
DTSTART:20081016T120000
DTEND:20081016T133000
SUMMARY:READING GROUP: Large Margin Taxonomy Embedding with an Application to Document Categorization
DESCRIPTION:presented by Ted Sandler, CIS<br><br>Kilian Weinberger and Olivier Chapelle
 <br><br>Applications of multi-class classification, such as document catego
 rization, often appear in cost-sensitive settings. Recent work has signific
 antly improved the state of the art by moving beyond &quot;flat&quot; class
 ification through incorporation of class hierarchies. We present a novel al
 gorithm that goes beyond hierarchical classification and estimates the late
 nt semantic space that underlies the class hierarchy. In this space, each c
 lass is represented by a prototype and classification is done with the simp
 le nearest neighbor rule. The optimization of the semantic space incorporat
 es large margin constraints that ensure that for each instance the correct 
 class prototype is closer than any other. We show that our optimization is 
 convex and can be solved efficiently for large data sets. Experiments on th
 e OHSUMED medical journal data base yield state-of-the-art results on topic
  categorization. <br><br>Link: http://www.weinbergerweb.net/kqw/Publication
 s/Entries/2008/9/9_Large_Margin_Taxonomy_Embedding_with_an_Application_to_D
 ocument_Categorization__files/taxo_nips.pdf<br><br>Food: Sushi
LOCATION:Levine 512
END:VEVENT
BEGIN:VEVENT
DTSTART:20081023T120000
DTEND:20081023T133000
SUMMARY:READING GROUP: A Theory of Learning with Similarity Functions
DESCRIPTION:presented by Ben Taskar, CIS<br><br>Maria-Florina Balcan, Avrim Blum, and N
 athan Srebro<br><br>Kernel functions have become an extremely popular tool 
 in machine learning,  with an attractive theory as well. This theory views 
 a kernel as implicitly mapping data  points into a possibly very high dimen
 sional space, and describes a kernel function as being  good for a given le
 arning problem if data is separable by a large margin in that implicit spac
 e.  However, while quite elegant, this theory does not necessarily correspo
 nd to the intuition of  a good kernel as a good measure of similarity, and 
 the underlying margin in the implicit  space usually is not apparent in &qu
 ot;natural&quot; representations of the data. Therefore, it may be  difficu
 lt for a domain expert to use the theory to help design an appropriate kern
 el for the  learning task at hand. Moreover, the requirement of positive se
 mi-definiteness may rule out  the most natural pairwise similarity function
 s for the given problem domain.<br><br>In this work we develop an alternati
 ve, more general theory of learning with similarity  functions (i.e., suffi
 cient conditions for a similarity function to allow one to learn well)  tha
 t does not require reference to implicit spaces, and does not require the f
 unction to be  positive semi-definite (or even symmetric). Instead, our the
 ory talks in terms of more direct  properties of how the function behaves a
 s a similarity measure. Our results also generalize  the standard theory in
  the sense that any good kernel function under the usual definition  can be
  shown to also be a good similarity function under our definition (though w
 ith some  loss in the parameters). In this way, we provide the first steps 
 towards a theory of kernels  and more general similarity functions that des
 cribes the effectiveness of a given function in  terms of natural similarit
 y-based properties.  <br><br>Link: http://ttic.uchicago.edu/~nati/Publicati
 ons/BBS-ML08.pdf<br><br>Food: Picnic
LOCATION:Levine 512
END:VEVENT
BEGIN:VEVENT
DTSTART:20081030T120000
DTEND:20081030T133000
SUMMARY:READING GROUP: Clustering on the Unit Hypersphere using von Mises-Fisher Distributions
DESCRIPTION:presented by Bill Kandylas, CIS<br><br>Several large scale data mining appl
 ications, such as text categorization and gene expression analysis, involve
  high-dimensional data that is also inherently directional in nature. Often
  such data is L2 normalized so that it lies on the surface of a unit hypers
 phere. Popular models such as (mixtures of) multi-variate Gaussians are ina
 dequate for characterizing such data. This paper proposes a generative mixt
 ure-model approach to clustering directional data based on the von Mises-Fi
 sher (vMF) distribution, which arises naturally for data distributed on the
  unit hypersphere. In particular, we derive and analyze two variants of the
  Expectation Maximization (EM) framework for estimating the mean and concen
 tration parameters of this mixture. Numerical estimation of the concentrati
 on parameters is non-trivial in high dimensions since it involves functiona
 l inversion of ratios of Bessel functions. We also formulate two clustering
  algorithms corresponding to the variants of EM that we derive. Our approac
 h provides a theoretical basis for the use of cosine similarity that has be
 en widely employed by the information retrieval community, and obtains the 
 spherical kmeans algorithm (kmeans with cosine similarity) as a special cas
 e of both variants. Empirical results on clustering of high-dimensional tex
 t and gene-expression data based on a mixture of vMF distributions show tha
 t the ability to estimate the concentration parameter for each vMF componen
 t, which is not present in existing approaches, yields superior results, es
 pecially for difficult clustering tasks in high-dimensional spaces.<br><br>
 Link: http://jmlr.csail.mit.edu/papers/volume6/banerjee05a/banerjee05a.pdf<
 br><br>Food: Chinese
LOCATION:Levine 512
END:VEVENT
BEGIN:VEVENT
DTSTART:20081113T120000
DTEND:20081113T133000
SUMMARY:Approximating the Entropy Using a Sublinear Sample
DESCRIPTION:Mickey Brautbar, CIS<br><br>We consider the problem of approximating the en
 tropy of a discrete distribution P with an alphabet of size q, given access
  to n independent samples from the distribution. It is known that n &gt;= q
  is necessary, in general, for a good additive estimate of the entropy. A p
 roblem of multiplicative entropy estimate was recently addressed by Batu, D
 asgupta, Kumar, and Rubinfeld. They show that n = q^{alpha} suffices for a 
 factor-alpha approximation, alpha &lt; 1.<br><br>We introduce a new paramet
 er of a distribution - its effective alphabet size  q_{ef}(P). This is a mo
 re intrinsic property of the distribution depending only on its entropy mom
 ents. We show q_{ef} &lt;= O(q). When the distribution P is essentially con
 centrated on a small part of the domain, it is shown that q_{ef} &lt;&lt; q
 . We strengthen the result of Batu et al. by showing it holds with q_{ef} r
 eplacing q.<br><br>This has several implications. In particular the rate of
  convergence of the maximum-likelihood entropy estimator (the empirical ent
 ropy) for both finite and infinite alphabets is shown to be dictated by the
  effective alphabet size of the distribution. Several new, and some known, 
 facts about this estimator follow easily.<br><br>Our main result is algorit
 hmic. Though the effective alphabet size is, in general, an unknown paramet
 er of the distribution, we give an efficient procedure (with access to the 
 alphabet size only) that achieves a factor-alpha approximation of the entro
 py with n = O(exp{alpha^{1/4} * log^{3/4} q * log^{1/4} q_{ef} }).<br><br>A
 ssuming (for instance) log q_{ef} &lt;&lt; log q this is smaller than any p
 ower of q.<br><br>Finally, if time permits, I will discuss our results rega
 rding entropy approximation via Lempel-Ziv 78' compression algorithm.<br><b
 r>Join work with Alex Samorodnitski (Hebrew University) <br><br>Food: India
 n
LOCATION:Levine 512
END:VEVENT
BEGIN:VEVENT
DTSTART:20081120T120000
DTEND:20081120T133000
SUMMARY:Apprenticeship and Imitation Learning
DESCRIPTION:Umar Syed, Princeton University<br><br>In supervised learning, a learner re
 ceives training examples labeled by an expert. In reinforcement learning, t
 he training signal is more diffuse, and consists of feedback in the form of
  rewards. In this talk, I'll discuss algorithms for learning problems that 
 have features of both supervised and reinforcement learning. Specifically, 
 I'll describe how one can leverage advice from an expert to learn an optima
 l policy in an environment where the true reward function is only partially
  known.  An interesting consequence of our analysis is a novel performance 
 guarantee for multiplicative weights algorithms. I'll also describe a metho
 d for modeling the behavior of a reward-seeking agent, by using information
  about the reward function to impose constraints on the space of parameters
  searched by an EM algorithm.<br><br>Joint work with Rob Schapire, Michael 
 Bowling (U Alberta), and Jason Williams (AT&amp;T)<br><br>Food: Sushi
LOCATION:Levine 512
END:VEVENT
BEGIN:VEVENT
DTSTART:20081204T120000
DTEND:20081204T133000
SUMMARY:Efficient Feature Selection in the Presence of Multiple Feature Classes
DESCRIPTION:Paramveer Dhillon, CIS<br><br>We present an information theoretic approach 
 to feature selection when the data possesses feature classes. Feature class
 es are pervasive in real data. For example, in gene expression data, the ge
 nes which serve as features may be divided into classes based on their memb
 ership in gene families or pathways. When doing word sense disambiguation o
 r named entity extraction, features fall into classes including adjacent wo
 rds, their parts of speech, and the topic and venue of the document the wor
 d is in. When predictive features occur predominantly in a small number of 
 feature classes, our information theoretic approach significantly improves 
 feature selection. Experiments on real and synthetic data demonstrate subst
 antial improvement in predictive accuracy over the standard L_0 penalty-bas
 ed stepwise and streamwise feature selection methods as well as over Lasso 
 and Elastic Nets, all of which are oblivious to the existence of feature cl
 asses.<br><br>Food: pizza
LOCATION:Levine 512
END:VEVENT
BEGIN:VEVENT
DTSTART:20090326T103000
DTEND:20090326T120000
SUMMARY:READING GROUP: Leaving the Span (Warmuth, Vishwanathan)
DESCRIPTION:presented by Mickey Brautbar, CIS<br><br>&quot;We discuss a simple sparse l
 inear problem that is hard to learn with any algorithm that uses a linear c
 ombination of the training instances as its weight vector. The hardness hol
 ds even if we allow the learner to embed the instances into any higher dime
 nsional feature space (and use a kernel function to define the dot product 
 between the embedded instances). These algorithms are inherently limited by
  the fact that after seeing k instances only a weight space of dimension k 
 can be spanned. Our hardness result is surprising because the same problem 
 can be efficiently learned using the exponentiated gradient (EG) algorithm:
  Now the component-wise logarithms of the weights are essentially a linear 
 combination of the training instances and after seeing k instances. This al
 gorithm enforces additional constraints on the weights (all must be non-neg
 ative and sum to one) and in some cases these constraints alone force the r
 ank of the weight space to grow as fast as 2^k.&quot;<br><br>Link: http://w
 ww.soe.ucsc.edu/~manfred/pubs/C71.pdf<br><br>Food: TBD
LOCATION:Levine 512
END:VEVENT
BEGIN:VEVENT
DTSTART:20090402T103000
DTEND:20090402T120000
SUMMARY:MLUNCH speaker TBD
DESCRIPTION:
LOCATION:Levine 512
END:VEVENT
BEGIN:VEVENT
DTSTART:20090409T100000
DTEND:20090409T120000
SUMMARY:READING GROUP: Spectral Graph Theory
DESCRIPTION:presented by Alex Kulesza, CIS<br><br>We'll be going over the material in C
 hapter 1 of the newly revised edition of Spectral Graph Theory book by Fan 
 Chung:<br><br>&quot;This monograph is an intertwined tale of eigenvalues an
 d their use in unlocking a thousand secrets about graphs. The stories will 
 be told --- how the spectrum reveals fundamental properties of a graph, how
  spectral graph theory links the discrete universe to the continuous one th
 rough geometric, analytic and algebraic techniques, and how, through eigenv
 alues, theory and applications in communications and computer science come 
 together in symbiotic harmony....&quot; <br><br>Link: http://www.math.ucsd.
 edu/~fan/research/revised.html<br><br>Food: Chinese
LOCATION:Levine 512
END:VEVENT
BEGIN:VEVENT
DTSTART:20090416T103000
DTEND:20090416T120000
SUMMARY:MLUNCH speaker TBD
DESCRIPTION:
LOCATION:Levine 512
END:VEVENT
BEGIN:VEVENT
DTSTART:20090423T103000
DTEND:20090423T120000
SUMMARY:Ranking Problems in Machine Learning: Theory and Applications
DESCRIPTION:Shivani Agarwal, Department of Electrical Engineering &amp; Computer Scienc
 e, MIT<br><br>In the last few decades, there has been considerable progress
  in the understanding of binary classification (learning of binary-valued f
 unctions) and regression (learning of real-valued functions), both classica
 l problems in machine learning. Although several questions remain to be ans
 wered, there is a well-developed theory in place for these problems, and pr
 actical successes have been demonstrated in a variety of applications.<br><
 br>Recently, a new class of learning problems, namely ranking problems, hav
 e begun to gain attention. In ranking, one learns a real-valued function th
 at assigns scores to objects, but the scores themselves do not matter; inst
 ead, what is important is the relative ranking of objects induced by those 
 scores. Ranking problems arise in a variety of domains: in information retr
 ieval, one wants to rank documents according to relevance to some topic or 
 query; in user-preference modeling, one wants to rank items according to a 
 user's likes and dislikes; in computational biology, one wants to rank gene
 s according to relevance to some disease. Ranking problems are mathematical
 ly distinct from both classification and regression, and cannot be analyzed
  using existing results for these problems.<br><br>In this talk, I will des
 cribe some recent results in both the theoretical understanding of ranking 
 and its applications. In particular, I will describe generalization bounds 
 for ranking algorithms based on the tools of uniform convergence and algori
 thmic stability, and some preliminary results on the sample complexity of l
 earning ranking functions. I will conclude with some recent applications to
  ranking chemical structures for drug discovery.<br><br>Bio:<br><br>Shivani
  Agarwal is a postdoctoral lecturer in the Department of Electrical Enginee
 ring and Computer Science at MIT. Prior to this, she received her PhD in co
 mputer science from the University of Illinois, Urbana-Champaign, where she
  received the Liu Award for her research. Her research interests include ma
 chine learning and learning theory, in particular the study of ranking and 
 other new learning problems, as well as applications of machine learning me
 thods, particularly in the life sciences. More broadly, she is excited by r
 esearch at the intersection of computer science, mathematics, and statistic
 s, and its applications in scientific discovery.<br><br>Food: TBD
LOCATION:Levine 512
END:VEVENT
BEGIN:VEVENT
DTSTART:20090430T100000
DTEND:20090430T120000
SUMMARY:A Language of Life: Identifying People from Sensor Data Streams
DESCRIPTION:Alexy Khrabrov, UPenn CIS and Thayer School, Dartmouth College<br><br>Mobil
 e devices produce continuous stream of data which is often specific to the 
 person carrying them. Can we identify a person by observing a part of such 
 a data stream? We show how it can be done for cell phone tracks in the MIT 
 Reality data. We treat each person's data as a separate language, and use a
  standard n-gram language model to identify each author. This method achiev
 es an 85% precision, and can also be used for clustering. The language mode
 ls can also be used for prediction -- we show how this can be done, and pro
 pose metrics to measure the accuracy of the predictions. Finally, we develo
 p an alternative method, identifying individuals by counting the subsequenc
 es in a sample which are unique to authors. We do it by building a generali
 zed suffix tree of the training set, achieving comparable precision at a fe
 w more samples, but faster. We present the identification and prediction pr
 oblem as a part of the Dartmouth human behavior modeling framework, outline
  the modeling goals, and show how our methods can be generalized.<br><br>--
 --- Alexy Khrabrov is a CIS Ph.D. candidate and a research associate at Tha
 yer School, Dartmouth College. In between research he was a senior software
  engineer at NECI, Princeton, and Amazon.com.<br><br>Food: TBD
LOCATION:Levine 512
END:VEVENT
BEGIN:VEVENT
DTSTART:20090625T120000
DTEND:20090625T133000
SUMMARY:READING GROUP: An Impossibility Theorem for Clustering
DESCRIPTION:presented by David Weiss, CIS<br><br>Although the study of clustering is ce
 ntered around an intuitively compelling goal, it has been very dicult to d
 evelop a unied framework for reasoning about it at a technical level, and 
 pro- foundly diverse approaches to clustering abound in the research commun
 ity. Here we suggest a formal perspective on the diculty in nding such a 
 unication, in the form of an impossibility theo- rem: for a set of three s
 imple properties, we show that there is no clustering function satisfying a
 ll three. Relaxations of these prop- erties expose some of the interesting 
 (and unavoidable) trade-os at work in well-studied clustering techniques s
 uch as single-linkage, sum-of-pairs, k-means, and k-median.<br><br><br><br>
 Link: http://www.cs.cornell.edu/home/kleinber/nips15.pdf<br><br>Food: Chine
 se
LOCATION:TBA
END:VEVENT
END:VCALENDAR

