Michael Mahoney's new home page

Basic info
I am at ICSI and the Department of Statistics at UC Berkeley, and
I am also in the AMPLab
in the Department of Computer Science.
You can reach me via electronic mail:
mmahoney
is the username, and
stat dot berkeley dot edu
follows the at symbol.
(Of course, my gmail address is still valid.)
Also, some other (somewhat outdated) information can be found at
my old web page
at Stanford and
my older web page
at Yale.

Recent items of interest:
I posted on the arXiv (click here and here, respectively) the Lecture Notes on Spectral Graph Methods and the Lecture Notes on Randomized Linear Algebra, that are based on the classes I taught last year and the year before, respectively. They will both be updated at some point.
Our analysis of the performance of linear algebra in Spark has been posted on the arXiv.
Click here for the code; and
click here for a talk based on it that was given at the 2016 Dato Data Science Summit.
The article on "RandNLA: Randomized Numerical Linear Algebra" by Petros Drineas and me, as well as an associated video synopsis has appeared in the Communications of the ACM.
Congratulations to postdoc Julian Shun who has been awarded ACM's doctoral dissertation award for his 2015 CMU doctoral thesis "SharedMemory Parallelism Can Be Simple, Fast, and Scalable."
More details can be found in the official ACM announcement.
We are running MMDS again!
MMDS 2016
will take place
on the campus of UC Berkeley on Tuesday, June 21 through Friday, June 24, 2016.
See the
main MMDS web page
for more information.
A summer school on the "Mathematics of Data" is being organized by Anna Gilbert, John Duchi, and me, with the Park City Mathematics Institute.
Dates: June 30 to July 20, 2016.
Location: near Park City, Utah.
Deadline for applications: January 15, 2016.
Click
here
for the flyer and
here for the web page.
Two articles on HPC Wire:
click here for a discussion of our collaboration between AMPLab, Cray, and NERSC/LBL, where we will develop and implement randomized linear algebra (RLA) algorithms in Spark on HPC platforms for LBL's scientific data applications; and
click here for the role of RLA in HPC more generally.
We will be running the Gene Golub SIAM Summer School on
"RandNLA: Randomization in Numerical Linear Algebra".
The dates are June 1526, 2015, and the location is Delphi, Greece.
Click
here
for the flyerand note that the deadline for student applications is February 1, 2015.
Update (1/16): click
here or
here
for the SIAM News article about it.
Teaching,
Spring 2015:
"Stat260/CS294: Topics in Spectral Graph Methods".
First meeting is Thursday, January 22, 2015 in 320 Soda Hall.
We ran MMDS again.
MMDS 2014
took place
on the campus of UC Berkeley on Tuesday, June 17 through Friday, June 20, 2014.
See the
main MMDS web page
for more information.
The videos for MMDS 2014 are available
here.
I attended and gave a talk on "BIG Biomedicine and the Foundations of BIG Data Analysis"
at Stanford University at Stanford Medical School, May 23, 2014,
at their "Big Data in BioMedicine" Conference.
Click
"here"
for a video of the talk.
They also got two nice CEOstyle pictures:
here
I am, and
here
I am again.
I'm moving to ICSI and the Department of Statistics at UC Berkeley.
Teaching,
Fall 2013, at UC Berkeley:
Stat260/CS294: Randomized Algorithms for Matrices and Data, which will be an overview of Randomized Linear Algebra.
Our NRC report on the "Frontiers in Massive Data Analysis" is out and
available here.
I'm involved with running the program on "Theoretical Foundations of Big
Data Analysis," to be held at the Simons Institute at UC Berkeley during the
fall of 2013.
Click
here
for details; in particular, note the link at the bottom of the page to
information about fellowships for young researchers to participate.
We ran MMDS again.
MMDS 2012
took place on the campus of Stanford University on July 1013, 2012.
Speaker videos and presentations are now available.
For pdfs and videos of the speaker presentations, go to the main MMDS web
page; or click
here for the entire
video collection.
The overview article on
"Approximate Computation and Implicit Regularization for Very Largescale Data Analysis"
associated with the invited talk at PODS 2012 meeting is on the arXiv
here.
LSRN:
An initial version of our code for LSRN, the randomized leastsquares solver
for parallel environments, is available
here.
The monograph on "Randomized Algorithms for Matrices and Data" is available
in NOW's "Foundations and Trends in Machine Learning" series
here,
and it is also
available on the arXiv here.


Click
here
for information (including the slides and video!) on the Tutorial on
"Geometric Tools for Identifying Structure in Large Social and Information
Networks," given originally at ICML10 and KDD10 and subsequently at many
other places. (The slides are also linked to below.)
The overview chapter on
"Algorithmic and Statistical Perspectives on LargeScale Data Analysis" is
finally on the arXiv
here; the book in which it will
appear is in press; and a video of the associated talk is
here.
Teaching:
Fall 2009:
CS369M: Algorithms for Massive Data Set Analysis
We have made public many of the networks we have used in our "Community
Structure" papers.
Click
here
for the networks and for related information;
older networks are still
here.
Our PNAS paper on "CUR Matrix Decompositions for Improved Data Analysis"
has appeared as
Proc. Natl. Acad. Sci. USA, 106, 697702 (2009).
Please email me if you can't access a copy from the PNAS web site.
Research interests

Algorithmic and statistical aspects of modern largescale data analysis.

Randomized linear algebra and randomized numerical linear algebra.

Implicit regularization and implicit optimization
methods in scalable approximation algorithms.

Graph approximation algorithms and applications to large social and information networks.

Applications to DNA microarray, SNP, astronomical, medical imaging, and other scientific data.
Much of my current research focuses on
Randomized Numerical Linear Algebra, i.e., using random sampling and
random projection methods to solve very large matrixbased problems;
developing
geometric network analysis tools, i.e., using scalable approximation
algorithms with a geometric
or statistical flavor to analyze the structure and dynamics of large
informatics graphs;
developing approximate computation and regularization methods for large
informatics graphs;
applications to community detection, clustering, and information dynamics in
large social and information networks; and
applications to DNA single nucleotide polymorphism (SNP) data,
astronomical and medical imaging data,
and largescale statistical data analysis more generally.
In the past, I developed and analyzed algorithms for large matrix, graph,
and regression problems, and I applied these and related tools to the
statistical data analysis of extremely large scientific and Internet data
sets.
For example, I worked on
largescale web analytics, machine learning, and query log analysis;
applications of graph partitioning algorithms to clustering and
community identification; and
applications of randomized matrix algorithms to hyperspectral medical image
data, DNA microarray data, and DNA SNP data.
In the more distant past, I have also worked on developing and analyzing
Monte Carlo algorithms for performing useful computations on extremely
large matrices,
e.g., the additiveerror and relativeerror CUR matrix decompositions.
Past research has also included work in computational statistical mechanics
on the
development and analysis of the
TIP5P
model of liquid water, as well as work in both computational and experimental
biophysics on proteins and proteinnucleic acid interactions.
Students, postdocs, etc.
Aditya Devarakonda (UC Berkeley)
Julian Shun (postdoc, UC Berkeley)
Kimon Fountoulakis (postdoc, UC Berkeley)
Fred Roosta (postdoc, UC Berkeley)
Alex Gittens (postdoc, UC Berkeley)
Jiyan Yang (Stanford)
Aaron Adcock (Stanford PhD, 2014; at Facebook)
Xiangrui Meng (Stanford PhD, 2014; at LinkedIn, then Databricks)
Lorenzo Orecchia (intern, Yahoo; now at Boston)
Jure Leskovec (intern, Yahoo; now at Stanford)
Hari Narayanan (intern, Yahoo; now at Washington)
LekHeng Lim (intern, Yahoo; now at Chicago)
Boulos Harb (intern, Yahoo; now at Google)
Publications
2016

Mapping the Similarities of Spectra: Global and Locallybiased Approaches to SDSS Galaxy Data,

D. Lawlor, T. Budavari, and M. W. Mahoney

Technical Report, Preprint: arXiv:1609.03932 (2016)
(arXiv),

Accepted for publication in The Astrophysical Journal.

Lecture Notes on Spectral Graph Methods,

M. W. Mahoney,

Technical Report, Preprint: arXiv:1608.04845 (2016)
(arXiv),

Lecture Notes on Randomized Linear Algebra,

M. W. Mahoney,

Technical Report, Preprint: arXiv:1608.04481 (2016)
(arXiv),

An optimization approach to locallybiased graph algorithms,

K. Fountoulakis, D. F. Gleich, M. W. Mahoney,

Technical Report, Preprint: arXiv:1607.04940 (2016)
(arXiv),

Accepted for publication in Proceedings of the IEEE.

DCAR: A Discriminative and Compact Audio Representation to Improve Event Detection,

L. Jing, B. Liu, J. Choi, A. Janin, J. Bernd, M. W. Mahoney, and G. Friedland,

Technical Report, Preprint: arXiv:1607.04378 (2016)
(arXiv),

Accepted to: ACMMM (2016),

Submitted for publication.

Matrix Factorization at Scale: a Comparison of Scientific Data Analytics in Spark and C+MPI Using Three Case Studies,

A. Gittens, A. Devarakonda, E. Racah, M. Ringenburg, L. Gerhardt, J. Kottaalam, J. Liu, K. Maschhoff, S. Canon, J. Chhugani, P. Sharma, J. Yang, J. Demmel, J. Harrell, V. Krishnamurthy, M. W. Mahoney, and Prabhat,

Technical Report, Preprint: arXiv:1607.01335 (2016)
(arXiv),
(code),

Submitted for publication.

Subsampled Newton Methods with Nonuniform Sampling,

P. Xu, J. Yang, F. RoostaKhorasani, C. Re, and M. W. Mahoney,

Technical Report, Preprint: arXiv:1607.00559 (2016)
(arXiv),

Submitted for publication.

Approximating the Solution to Mixed Packing and Covering LPs in parallel
time,

M. W. Mahoney, S. Rao, D. Wang, and P. Zhang,

Proc. of the 43rd ICALP Conference, (2016)
(pdf).

A Simple and StronglyLocal FlowBased Method for Cut Improvement,

N. Veldt, D. F. Gleich, and M. W. Mahoney,

Technical Report, Preprint: arXiv:1605.08490 (2016)
(arXiv),

Proc. of the 33rd ICML Conference 19381947 (2016)
(pdf),
(supp).

RandNLA: Randomized Numerical Linear Algebra,

P. Drineas and M. W. Mahoney,

Communications of the ACM, 59, 8090 (2016)
(pdf).

FLAG: Fast LinearlyCoupled Adaptive Gradient Method,

X. Cheng, F. RoostaKhorasani, P. L. Bartlett, and M. W. Mahoney,

Technical Report, Preprint: arXiv:1605.08108 (2016)
(arXiv),

Submitted for publication.

Parallel Local Graph Clustering,

J. Shun, F. RoostaKhorasani, K. Fountoulakis, and M. W. Mahoney,

Technical Report, Preprint: arXiv:1604.07515 (2016)
(arXiv),

Proceedings of the VLDB Endowment, 9(12) 10411052 (2016)
(pdf).

A multiplatform evaluation of the randomized CX lowrank matrix factorization in Spark,

A. Gittens, J. Kottalam, J. Yang, M. F. Ringenburg, J. Chhugani, E. Racah, M. Singh, Y. Yao, C. Fischer, O. Ruebel, B. Bowen, N. G. Lewis, M. W. Mahoney, V. Krishnamurthy, and Prabhat,

Proc. 5th International Workshop on Parallel and Distributed Computing for Large Scale Machine Learning and Big Data Analytics, at IPDPS,
2016
(pdf).

Mining Large Graphs,

D. F. Gleich and M. W. Mahoney,

In
Handbook of Big Data.
pp. 191220,
edited by
P. Buhlmann, P. Drineas, M. Kane, and M. van de Laan,
Chapman and Hall/CRC Press,
2016
(pdf).

Structural properties underlying highquality Randomized Numerical Linear Algebra algorithms,

M. W. Mahoney and P. Drineas,

In
Handbook of Big Data.
pp. 137154,
edited by
P. Buhlmann, P. Drineas, M. Kane, and M. van de Laan,
Chapman and Hall/CRC Press,
2016
(pdf).

Exploiting Optimization for Local Graph Clustering,

K. Fountoulakis, X. Cheng, J. Shun, F. RoostaKhorasani and M. W. Mahoney,

Technical Report, Preprint: arXiv:1602.01886 (2016)
(arXiv),

Submitted for publication.

SubSampled Newton Methods II: Local Convergence Rates,

F. RoostaKhorasani and M. W. Mahoney,

Technical Report, Preprint: arXiv:1601.04738 (2016)
(arXiv),

SubSampled Newton Methods I: Globally Convergent Algorithms,

F. RoostaKhorasani and M. W. Mahoney,

Technical Report, Preprint: arXiv:1601.04737 (2016)
(arXiv),

RandNLA, Pythons, and the CUR for Your Data Problems: Reporting from G2S3 2015 in Delphi,

E. Gallopoulos, P. Drineas, I. Ipsen, and M. W. Mahoney,

SIAM News 49:1 January/February 2016
(web),
(pdf).
2015

Faster Parallel Solver for Positive Linear Programs via DynamicallyBucketed Selective Coordinate Descent,

D. Wang, M. W. Mahoney, N. Mohan, and S. Rao,

Technical Report, Preprint: arXiv:1511.06468 (2015)
(arXiv).

A Local Perspective on Community Structure in Multilayer Networks,

L. G. S. Jeub, M. W. Mahoney, P. J. Mucha, and M. A. Porter,

Technical Report, Preprint: arXiv:1510.05185 (2015)
(arXiv),

Accepted for publication, Network Science.

Optimal Subsampling Approaches for Large Sample Linear Regression,

R. Zhu, P. Ma, M. W. Mahoney, and B. Yu,

Technical Report, Preprint: arXiv:1509.05111 (2015)
(arXiv).

Unified Acceleration Method for Packing and Covering Problems via Diameter Reduction,

D. Wang, S. Rao, and M. W. Mahoney,

Technical Report, Preprint: arXiv:1508.02439 (2015)
(arXiv),

Proc. of the 43rd ICALP Conference, (2016)
(pdf).

Using local spectral methods to robustify graphbased learning algorithms,

D. F. Gleich and M. W. Mahoney,

Proc. of the 21st Annual SIGKDD, (2015)
(pdf)
(code).

Structured Block Basis Factorization for Scalable Kernel Matrix Evaluation,

R. Wang, Y. Li, M. W. Mahoney, and E. Darve,

Technical Report, Preprint: arXiv:1502.03571 (2015)
(arXiv).

Identifying important ions and positions in mass spectrometry imaging data using CUR matrix decompositions,

J. Yang, O. Rubel, Prabhat, M. W. Mahoney, and B. P. Bowen,

Analytical Chemistry, 87 (9), 46584666 (2015)
(pdf)
(code).

Spectral Gap Error Bounds for Improving CUR Matrix Decomposition and the Nystrom Method,

D. G. Anderson, S. S. Du, M. W. Mahoney, C. Melgaard, K. Wu, and M. Gu,

Proc. of the 18th International Conference on AISTATS, 1927 (2015)
(pdf,
supp)
(code).

Weighted SGD for Lp Regression with Randomized Preconditioning,

J. Yang, Y.L. Chow, C. Re, and M. W. Mahoney,

Technical Report, Preprint: arXiv:1502.03571 (2015)
(arXiv),

Proc. of the 27th Annual SODA, 558569 (2016)
(pdf),

Journal version submitted for publication.

Implementing Randomized Matrix Algorithms in Parallel and Distributed Environments,

J. Yang, X. Meng, and M. W. Mahoney,

Technical Report, Preprint: arXiv:1502.03032 (2015)
(arXiv)
(code),

Proceedings of the IEEE 104(1): 5892 (2016)
(pdf).
2014

Tree decompositions and social graphs,

A. B. Adcock, B. D. Sullivan, and M. W. Mahoney,

Technical Report, Preprint: arXiv:1411.1546 (2014)
(arXiv),
(code).

Internet Mathematics, 12(5), 315361 (2016)
(pdf).

Fast Randomized Kernel Methods With Statistical Guarantees,

A. El Alaoui and M. W. Mahoney,

Technical Report, Preprint: arXiv:1411.0306 (2014)
(arXiv),

Proc. of the 2015 NIPS Conference
(pdf).

A Statistical Perspective on Randomized Sketching for Ordinary LeastSquares,

G. Raskutti and M. W. Mahoney,

Technical Report, Preprint: arXiv:1406.5986 (2014)
(arXiv),

Proc. of the 32nd ICML Conference (2015)
(pdf),

Journal version submitted for publication.

Random Laplace Feature Maps for Semigroup Kernels on Histograms,

J. Yang, V. Sindhwani, Q. Fan, H. Avron, and M. W. Mahoney,

Proc. of the 27th CVPR Conference (2014)
(pdf).

Antidifferentiating Approximation Algorithms: A case study with Mincuts, Spectral, and Flow,

D. F. Gleich and M. W. Mahoney,

Proc. of the 31st ICML Conference, JMLR W&CP 32 (1): 10181025 (2014)
(pdf).
(code, code).
(talk).

QuasiMonte Carlo Feature Maps for ShiftInvariant Kernels,

J. Yang, V. Sindhwani, H. Avron, and M. W. Mahoney,

Proc. of the 31st ICML Conference, JMLR W&CP 32 (1): 485493 (2014)
(pdf),
(code),

Technical Report, Preprint: arXiv:1412.8293 (2014)
(arXiv),

J. Machine Learning Research, 17(120): 138 (2016)
(pdf).

Think Locally, Act Locally: The Detection of Small, MediumSized, and Large Communities in Large Networks,

L. G. S. Jeub, P. Balachandran, M. A. Porter, P. J. Mucha, and M. W. Mahoney,

Technical Report, Preprint: arXiv:1403.3795 (2014)
(arXiv),
(code, code),

Physical Review E, 91, 012821 (2015)
(pdf).

A new spin on an old algorithm: technical perspective on "Communication costs of Strassen's matrix multiplication,"

M. W. Mahoney,

Communications of the ACM, 57(2): 106 (2014)
(pdf).
2013

Treelike Structure in Large Social and Information Networks,

A. B. Adcock, B. D. Sullivan, and M. W. Mahoney,

Proc. of the 2013 IEEE ICDM, 110 (2013)
(pdf).

Objective Identification of Informative Wavelength Regions in Galaxy Spectra,

C.W. Yip, M. W. Mahoney, A. S. Szalay, I. Csabai, T. Budavari, R. F. G. Wyse,
and L. Dobos,

Technical Report, Preprint: arXiv:1312.0637 (2013)
(arXiv),

Astronomical Journal, 147, 110 (2014)
(pdf).

Evaluating OpenMP Tasking at Scale for the Computation of Graph Hyperbolicity,

A. B. Adcock, B. D. Sullivan, O. R. Hernandez, and M. W. Mahoney,

Proc. of the 9th IWOMP, 7183 (2013)
(pdf).

Frontiers in Massive Data Analysis,

Committee on the Analysis of Massive Data, et al. (M. I. Jordan, et al.),

The National Academies Press (2013)
(pdf),
(web).

A Statistical Perspective on Algorithmic Leveraging,

P. Ma, M. W. Mahoney, and B. Yu,

Technical Report, Preprint: arXiv:1306.5362 (2013)
(arXiv),

Proc. of the 31st ICML Conference, JMLR W&CP 32 (1): 9199 (2014)
(pdf),

J. Machine Learning Research, 16, 861911 (2015)
(pdf).

Robust Regression on MapReduce,

X. Meng, and M. W. Mahoney,

Proc. of the 30th ICML Conference, JMLR W&CP 28(3): 888896 (2013)
(pdf).

Quantile Regression for Largescale Applications,

J. Yang, X. Meng, and M. W. Mahoney,

Technical Report, Preprint: arXiv:1305.0087 (2013)
(arXiv),
(code),
(code),

Proc. of the 30th ICML Conference, JMLR W&CP 28(3): 881887 (2013)
(pdf),

SIAM J. Scientific Computing, 36(5), S78S110 (2014)
(pdf).

Revisiting the Nystrom Method for Improved LargeScale Machine Learning,

A. Gittens and M. W. Mahoney,

Technical Report, Preprint: arXiv:1303.1849 (2013)
(arXiv),
(code),

Proc. of the 30th ICML Conference, JMLR W&CP 28(3): 567575 (2013)
(pdf),

J. Machine Learning Research, 17(117): 165 (2016)
(pdf).
2012

Semisupervised Eigenvectors for Largescale Locallybiased Learning,

T. J. Hansen and M. W. Mahoney,

Proc. of the 2012 NIPS Conference
(pdf),
(code),

Technical Report, Preprint: arXiv:1304.7528 (2013)
(arXiv),

J. Machine Learning Research, 15, 36913734 (2014)
(pdf).

Lowdistortion Subspace Embeddings in Inputsparsity Time and Applications to Robust Linear Regression,

X. Meng and M. W. Mahoney,

Technical Report, Preprint: arXiv:1210.3135 (2012)
(arXiv),

Proc. of the 45th STOC, 91100 (2013)
(pdf).

The Fast Cauchy Transform and Faster Robust Linear Regression,

K. L. Clarkson, P. Drineas, M. MagdonIsmail, M. W. Mahoney, X. Meng, and D. P. Woodruff,

Technical Report, Preprint: arXiv:1207.4684 (2012)
(arXiv),

Proc. of the 24th Annual SODA, 466477 (2013)
(pdf),

SIAM J. Computing, 45, 763810 (2016)
(pdf).

rCUR: an R package for CUR matrix decomposition,

A. Bodor, I. Csabai, M. W. Mahoney, and N. Solymosi,

BMC Bioinformatics, 13:103 (2012)
(pdf),
(code).

Approximate Computation and Implicit Regularization for Very Largescale Data Analysis,

M. W. Mahoney,

Technical Report, Preprint: arXiv:1203.0786 (2012)
(arXiv),

Proc. of the 2012 ACM Symposium on Principles of Database Systems, 143154, 2012
(pdf).

On the Hyperbolicity of SmallWorld and TreeLike Random Graphs,

W. Chen, W. Fang, G. Hu, and M. W. Mahoney,

Technical Report, Preprint: arXiv:1201.1717 (2012)
(arXiv),

Proc. of the 23rd ISAAC 278288 (2012)
(pdf),

Internet Mathematics, 9(4), 434491 (2013)
(pdf).
2011

Randomized Dimensionality Reduction for Kmeans Clustering,

C. Boutsidis, A. Zouzias, M. W. Mahoney, and P. Drineas,

Technical Report, Preprint: arXiv:1110.2897 (2011)
(arXiv),

IEEE Transactions on Information Theory, 61(2), 10451062 (2015)
(pdf).

Regularized Laplacian Estimation and Fast Eigenvector Approximation,

P. O. Perry and M. W. Mahoney,

Technical Report, Preprint: arXiv:1110.1757 (2011)
(arXiv),

Proc. of the 2011 NIPS Conference
(pdf).

LSRN: A Parallel Iterative Solver for Strongly Over or UnderDetermined Systems,

X. Meng, M. A. Saunders, and M. W. Mahoney,

Technical Report, Preprint: arXiv:1109.5981 (2011)
(arXiv),
(code),

SIAM J. Scientific Computing, 36(2), C95C118 (2014)
(pdf).

Fast approximation of matrix coherence and statistical leverage,

P. Drineas, M. MagdonIsmail, M. W. Mahoney, and D. P. Woodruff,

Technical Report, Preprint: arXiv:1109.3843 (2011)
(arXiv),

Proc. of the 29th ICML Conference (2012)
(pdf),

J. Machine Learning Research, 13, 34753506 (2012)
(pdf).

Localization on loworder eigenvectors of data matrices,

M. Cucuringu and M. W. Mahoney,

Technical Report, Preprint: arXiv:1109.1355 (2011)
(arXiv).

Efficient Genomewide Selection of PCACorrelated tSNPs for Genotype Imputation,

A. Javed, P. Drineas, M. W. Mahoney, and P. Paschou,

Annals of Human Genetics, 75, 707722 (2011)
(
pdf).

Randomized Algorithms for Matrices and Data,

M. W. Mahoney,

Foundations and Trends in Machine Learning,
NOW Publishers,
Volume 3, Issue 2, 2011
(now),

TR version:
Technical Report, Preprint: arXiv:1104.5557 (2011)
(arXiv).

(Abridged version in:
Advances in Machine Learning and Data Mining for Astronomy,
edited by
M. J. Way, et al.,
pp. 647672,
2012.)
2010

Computation in LargeScale Scientific and Internet Data Applications is a Focus of MMDS 2010,

M. W. Mahoney,

Technical Report, Preprint: arXiv:1012.4231 (2010)
(arXiv),

Appeared in
SIGKDD Explorations,
SIGACT News,
ASASCGN Newsletter,
and IMS Bulletin.

CUR from a Sparse Optimization Viewpoint,

J. Bien, Y. Xu, and M. W. Mahoney,

Technical Report, Preprint: arXiv:1011.0413 (2010)
(arXiv),

Proc. of the 2010 NIPS Conference
(ps,
pdf).

Algorithmic and Statistical Perspectives on LargeScale Data Analysis,

M. W. Mahoney,

Technical Report, Preprint: arXiv:1010.1609 (2010)
(arXiv),

In:
Combinatorial Scientific Computing,
pp. 427469,
edited by
U. Naumann and O. Schenk,
2012.

Implementing regularization implicitly via approximate eigenvector computation,

M. W. Mahoney and L. Orecchia,

Technical Report, Preprint: arXiv:1010.0703 (2010)
(arXiv),

Proc. of the 28th ICML Conference, 121128 (2011)
(pdf).
(talk).

Approximating HigherOrder Distances Using Random Projections,

P. Li, M. W. Mahoney, and Y. She,

Proc. of the 26th UAI Conference, 312321 (2010)
(ps,
pdf),

Technical Report, Preprint: arXiv:1203.3492 (2012)
(arXiv).

Effective Resistances, Statistical Leverage, and Applications to Linear Equation Solving,

P. Drineas and M. W. Mahoney,

Technical Report, Preprint: arXiv:1005.3097 (2010)
(arXiv).

Empirical Comparison of Algorithms for Network Community Detection,

J. Leskovec, K. J. Lang, and M. W. Mahoney,

Technical Report, Preprint: arXiv:1004.3539 (2010)
(arXiv),

Proc. of the 19th International WWW, 631640 (2010)
(ps,
pdf).
2009

A Local Spectral Method for Graphs: with Applications to Improving Graph
Partitions and Exploring Data Graphs Locally,

M. W. Mahoney, L. Orecchia, and N. K. Vishnoi,

Technical Report, Preprint: arXiv:0912.0681 (2009)
(arXiv),

J. Machine Learning Research, 13, 23392365 (2012)
(ps,
pdf).

Unsupervised Feature Selection for the kmeans Clustering Problem,

C. Boutsidis, M. W. Mahoney, and P. Drineas,

Proc. of the 2009 NIPS Conference
(ps,
pdf).

Learning with Spectral Kernels and HeavyTailed Data,

M. W. Mahoney and H. Narayanan,

Technical Report, Preprint: arXiv:0906.4539 (2009)
(arXiv).

Empirical Evaluation of Graph Partitioning Using Spectral Embeddings and Flow,

K. J. Lang, M. W. Mahoney, and L. Orecchia,

Proc. of the 8th International SEA, 197208 (2009)
(ps,
pdf).

CUR Matrix Decompositions for Improved Data Analysis,

M. W. Mahoney and P. Drineas,

Proc. Natl. Acad. Sci. USA, 106, 697702 (2009)
(ps,
pdf).
2008

An Improved Approximation Algorithm for the Column Subset Selection Problem,

C. Boutsidis, M. W. Mahoney, and P. Drineas,

Technical Report, Preprint: arXiv:0812.4293 (2008)
(arXiv),

Proc. of the 20th Annual SODA, 968977 (2009)
(ps,
pdf).

Algorithmic and Statistical Challenges in Modern LargeScale Data Analysis are the Focus of MMDS 2008

M. W. Mahoney, L.H. Lim, and G. E. Carlsson

Technical Report, Preprint: arXiv:0812.3702 (2008)
(arXiv),

Appeared in
SIGKDD Explorations
(ps,
pdf),
SIAM News
(ps,
pdf),
and
ASASCGN Newsletter
(ps,
pdf),
and abridged versions appeared in IMS Bulletin
(ps,
pdf)
and AmStat News.

Community Structure in Large Networks: Natural Cluster Sizes and the Absence of Large WellDefined Clusters,

J. Leskovec, K. J. Lang, A. Dasgupta, and M. W. Mahoney,

Technical Report, Preprint: arXiv:0810.1355 (2008)
(arXiv),

Internet Mathematics, 6(1), 29123 (2009)
(pdf).

Unsupervised Feature Selection for Principal Components Analysis,

C. Boutsidis, M. W. Mahoney, and P. Drineas,

Proc. of the 14th Annual SIGKDD, 6169 (2008)
(ps,
pdf).

Statistical Properties of Community Structure in Large Social and Information Networks,

J. Leskovec, K. J. Lang, A. Dasgupta, and M. W. Mahoney,

Proc. of the 17th International WWW, 695704 (2008)
(ps,
pdf).
2007

Faster Least Squares Approximation,

P. Drineas, M. W. Mahoney, S. Muthukrishnan, and T. Sarlos,

Technical Report, Preprint: arXiv:0710.1435 (2007)
(arXiv),

Numerische Mathematik, 117, 219249 (2011).

PCACorrelated SNPs for Structure Identification in Worldwide Human Populations,

P. Paschou, E. Ziv, E. G. Burchard, S. Choudhry, W. RodriguezCintron, M. W. Mahoney, and P. Drineas,

PLoS Genetics, 3, 16721686 (2007)
(ps,
pdf).

RelativeError CUR Matrix Decompositions,

P. Drineas, M. W. Mahoney, and S. Muthukrishnan,

Technical Report, Preprint: arXiv:0708.3696 (2007)
(arXiv),

SIAM J. Matrix Analysis and Applications, 30, 844881 (2008)
(ps,
pdf).

Feature Selection Methods for Text Classification,

A. Dasgupta, P. Drineas, B. Harb, V. Josifovski, and M. W. Mahoney,

Proc. of the 13th Annual SIGKDD, 230239 (2007)
(ps,
pdf).

Sampling Algorithms and Coresets for Lp Regression,

A. Dasgupta, P. Drineas, B. Harb, R. Kumar, and M. W. Mahoney,

Technical Report, Preprint: arXiv:0707.1714 (2007)
(arXiv),

Proc. of the 19th Annual SODA, 932941 (2008)
(ps,
pdf),

SIAM J. Computing, 38, 20602078 (2009)
(ps,
pdf).

Web Information Retrieval and Linear Algebra Algorithms,

A. Frommer, M. W. Mahoney, and D. B. Szyld (Eds.),

Proc. of Dagstuhl Seminar 07071, (2007)
(web).

Intra and interpopulation genotype reconstruction from tagging SNPs,

P. Paschou, M. W. Mahoney, A. Javed, J. R. Kidd, A. J. Pakstis, S. Gu, K. K. Kidd, and P. Drineas,

Genome Research, 17(1), 96107 (2007)
(ps,
pdf).
2006

Bridging the Gap Between Numerical Linear Algebra, Theoretical Computer Science, and Data Applications,

G. H. Golub, M. W. Mahoney, P. Drineas, and L.H. Lim,

SIAM News 39:8 October 2006
(ps,
pdf).

Randomized Algorithms for Matrices and Massive Data Sets,

P. Drineas and M. W. Mahoney,

Proc. of the 32nd Annual VLDB, 1269 (2006)
(ps,
pdf).

Subspace Sampling and RelativeError Matrix Approximation: ColumnRowBased Methods,

P. Drineas, M. W. Mahoney, and S. Muthukrishnan,

Proc. of the 14th Annual ESA, 304314 (2006)
(ps,
pdf).

Subspace Sampling and RelativeError Matrix Approximation: ColumnBased Methods,

P. Drineas, M. W. Mahoney, and S. Muthukrishnan,

Proc. of the 10th Annual RANDOM, 316326 (2006)
(ps,
pdf).

TensorCUR Decompositions For TensorBased Data,

M. W. Mahoney, M. Maggioni, and P. Drineas,

Proc. of the 12th Annual SIGKDD, 327336 (2006)
(ps,
pdf),

SIAM J. Matrix Analysis and Applications, 30, 957987 (2008)
(ps,
pdf).

Polynomial Time Algorithm for ColumnRowBased RelativeError LowRank Matrix Approximation,

P. Drineas, M. W. Mahoney, and S. Muthukrishnan,

Technical Report, DIMACS TR 200604 March 2006
(ps,
pdf).

Sampling Algorithms for L2 Regression and Applications,

P. Drineas, M. W. Mahoney, and S. Muthukrishnan,

Proc. of the 17th Annual SODA, 11271136 (2006)
(ps,
pdf).
2005

A Randomized Algorithm for a TensorBased Generalization of the Singular Value Decomposition,

P. Drineas and M. W. Mahoney,

Technical Report, YALEU/DCS/TR1327, June 2005
(ps,
pdf),

Linear Algebra and its Applications, 420, 553571 (2007)
(ps,
pdf).

On the Nystrom Method for Approximating a Gram Matrix for Improved KernelBased Learning,

P. Drineas and M. W. Mahoney,

Technical Report, YALEU/DCS/TR1319, April 2005
(ps,
pdf),

Proc. of the 18th Annual COLT, 323337 (2005)
(ps,
pdf),

J. Machine Learning Research, 6, 21532175 (2005)
(ps,
pdf).
2004

Sampling Subproblems of Heterogeneous MaxCut Problems and Approximation Algorithms,

P. Drineas, R. Kannan, and M. W. Mahoney,

Technical Report, YALEU/DCS/TR1283, April 2004
(ps,
pdf),

Proc. of the 22nd Annual STACS, 5768 (2005)
(ps,
pdf),

Random Structures and Algorithms, 32:3, 307333 (2008)
(ps,
pdf).

Fast Monte Carlo Algorithms for Matrices III: Computing an Efficient Approximate Decomposition of a Matrix,

P. Drineas, R. Kannan, and M. W. Mahoney,

Technical Report, YALEU/DCS/TR1271, February 2004
(ps,
pdf),

SIAM J. Computing, 36, 184206 (2006)
(ps,
pdf).

Fast Monte Carlo Algorithms for Matrices II: Computing LowRank Approximations to a Matrix,

P. Drineas, R. Kannan, and M. W. Mahoney,

Technical Report, YALEU/DCS/TR1270, February 2004
(ps,
pdf),

SIAM J. Computing, 36, 158183 (2006)
(ps,
pdf).

Fast Monte Carlo Algorithms for Matrices I: Approximating Matrix Multiplication,

P. Drineas, R. Kannan, and M. W. Mahoney,

Technical Report, YALEU/DCS/TR1269, February 2004
(ps,
pdf),

SIAM J. Computing, 36, 132157 (2006)
(ps,
pdf).
2003

Rapid Mixing of Several Markov Chains for a HardCore Model,

R. Kannan, M. W. Mahoney, and R. Montenegro,

Proc. of the 14th Annual ISAAC, 663675 (2003)
(ps,
pdf).
2001

Quantum, Intramolecular Flexibility, and Polarizability Effects on the Reproduction of the Density Anomaly of Liquid Water by Simple Potential Functions,

M. W. Mahoney and W. L. Jorgensen,

J. Chem. Phys., 115, 1075810768 (2001)
(ps,
pdf).

Rapid Estimation of Electronic Degrees of Freedom in Monte Carlo Calculations for Polarizable Models of Liquid Water,

M. W. Mahoney and W. L. Jorgensen,

J. Chem. Phys., 114, 93379349 (2001)
(ps,
pdf).

Diffusion Constant of the TIP5P Model of Liquid Water,

M. W. Mahoney and W. L. Jorgensen,

J. Chem. Phys., 114, 363366 (2001)
(ps,
pdf).
2000

A FiveSite Model for Liquid Water and the Reproduction of the Density Anomaly by Rigid, Nonpolarizable Potential Functions,

M. W. Mahoney and W. L. Jorgensen,

J. Chem. Phys., 112, 89108922 (2000)
(ps,
pdf).
1997

Repression and Activation of PromoterBound RNA Polymerase Activity by Gal Repressor,

H. E. Choy, R. R. Hanger, T. Aki, M. Mahoney, K. Murakami, A. Ishihama, and S. Adhya,

J. Mol. Biol. 272: 293300, 1997
(ps,
pdf).

Discrete Representations of the Protein Calpha Chain,

X. F. de la Cruz, M. W. Mahoney, and B. K. Lee,

Fold. & Des. 2: 223234, 1997
(ps,
pdf).
MMDS Workshops
I run the MMDS meetings.
We started the
MMDS Workshops on
"Algorithms for Modern Massive Data Sets"
to address
algorithmic and statistical
challenges in modern largescale statistical data analysis.

MMDS 2016
will take place on the campus of UC Berkeley on June 2124, 2016.
See the
main MMDS web page
for more information.

MMDS 2014
took place on the campus of UC Berkeley on June 1720, 2014.
See the
main MMDS web page
for more information.
Click
here for the entire
video collection.

MMDS 2012
took place on the campus of Stanford University on July 1013, 2012.
For pdfs and videos of the presentations, go to the main MMDS web
page; or click
here for the entire
video collection.

MMDS 2010
took place on the campus of Stanford University on June 1518, 2010.
MMDS 2010 addressed computation in largescale scientific and internet data
applications more generally.
See
the MMDS web page
for details, including articles and all the speaker overheads!

MMDS 2008
took place on June 2528, 2008.
MMDS 2008 grew out of our expectation for what
the algorithmic and statistical
foundations of
largescale
data analysis
should look like a generation from now.
Click
here
for an article that appeared in SIGKDD Explorations and
SIAM News about the meeting.

MMDS 2006
took place on June 2124, 2006.
MMDS 2006 was originally motivated by the complementary perspectives
brought by numerical linear algebra and theoretical computer science to
matrix algorithms in largescale data applications.
Click
here
for an article in SIAM News about the meeting.
These MMDS meetings generated intense interdisciplinary interest and were a
big success  so keep an eye out for future MMDSs!
Talks and presentations
Several recent talks:
Scientific Matrix Factorizations in Spark at Scale: Crossplatform performance, scaling, and comparisons with C+MPI
(talk at 2016 Dato Data Science Summit)
(pdf)
Optimization Algorithms for Analyzing Large Datasets
(talk at 2016 PCMI Summer School)
(pdf)
SubSampled Newton Methods
(talk at ITA 2016)
(pdf)
Challenges in Multiresolution Methods for Graphbased Learning
(talk, 3of3, from NIPS15 Workshops)
(pdf)
Using Local Spectral Methods in Theory and in Practice
(talk, 2of3, from NIPS15 Workshops)
(pdf)
Column Subset Selection on Terabytesized Scientific Data
(talk, 1of3, from NIPS15 Workshops)
(pdf)
Several tutorial presentations:
Randomization in Numerical Linear Algebra: Theory and Practice
(2.0 hr version at SIAM ALA Meeting, October 2015)
(pdf)
Past, Present and Future of Randomized Numerical Linear Algebra:
(3.0 hr version at Simons' Institute 2013 Big Data Bootcamp)
(Part I:
pdf,
ppt
and Part II:
pdf,
ppt)
Theory (and some practice) of Randomized Algorithms for Matrices and Data
(tutorial from FOCS 2012 Workshop)
(pdf,
ppt)
Geometric Tools for Identifying Structure in Large Social and Information Networks
(1.5 hr version at SAMSI Opening Workshop 2010, etc.)
(pdf,
ppt)
Geometric Tools for Identifying Structure in Large Social and Information Networks
(2 hr version at ICASSP 2011, etc.)
(pdf,
ppt)
Geometric Tools for Identifying Structure in Large Social and Information Networks
(3 hr version at ICML 2010 and KDD 2010, etc.)
(pdf,
ppt)
(The pdf file in four pieces:
here,
here,
here,
and
here.)
Randomized Algorithms for Matrices and Massive Data Sets
(at SIAMSDM06 2006 and VLDB 2006)
(ppt)
Randomized Algorithms for Matrices and Massive Data Sets
(at ACMSIGKDD 2005)
(ppt)
Several other older talks:
Linear and Sublinear Linear Algebra Algorithms: Preconditioning Stochastic Gradient Algorithms with Randomized Linear Algebra
(DIMACS, Aug 2015)
(pdf)
Overview of RandNLA: Randomized Numerical Linear Algebra
(pdf)
Treelike structure in social graphs
(pdf (big=28MB),
ppt (big=48MB))
Eigenvector localization, implicit regularization, and algorithmic antidifferentiation for largescale graphs and network data
(pdf,
ppt)
Locallybiased and semisupervised eigenvectors
(talk from MMDS 2014)
(pdf,
ppt)
Implicit regularization in sublinear approximation algorithms
(pdf,
ppt)
BIG Biomedicine and the Foundations of BIG Data Analysis (at Big Data in Biomedicine at Stanford's Medical School, 5/23/14)
(pdf,
ppt)
Revisiting the Nystrom Method for Improved LargeScale Machine Learning
(pdf,
ppt)
Implementing Randomized Matrix Algorithms in Parallel and Distributed Environments
(version from Simons Big Data Workshop II)
(pdf)
Inputsparsity Time Algorithms for Embeddings and Regression Problems
(talk from Simons Big Data Workshop I)
(pdf)
Randomized Regression in Parallel and Distributed Environments
(talk from GraphLab 2013)
(pdf)
Extracting insight from large networks: implications of smallscale and largescale structure
(pdf,
ppt)
Implementing Randomized Matrix Algorithms in Parallel and Distributed Environments
(version from MMDS 2012)
(pdf)
Sensors, networks, and massive data
(pdf,
ppt)
Randomized Algorithms for Matrices and Data
(pdf,
ppt)
Approximate computation and implicit regularization in largescale data analysis
(PODS vsn)
(pdf,
ppt)
Approximate computation and implicit regularization in largescale data analysis
(Stats vrsn1)
(pdf,
ppt)
Approximate computation and implicit regularization in largescale data analysis
(Short vrsn)
(pdf,
ppt)
Looking for clusters in your data ... in theory and in practice
(pdf,
ppt)
Fast Approximation of Matrix Coherence and Statistical Leverage
(pdf,
ppt)
Implementing regularization implicitly via approximate eigenvector computation
(pdf,
ppt)
Linear Algebra and Machine Learning of Large Informatics Graphs
(pdf,
ppt)
Geometric Network Analysis Tools
(talk from MMDS 2010)
(pdf,
ppt)
Algorithmic and Statistical Perspectives on LargeScale Data Analysis
(pdf,
ppt)
Community structure in large social and information networks
(newer)
(pdf,
ppt)
Statistical leverage and improved matrix algorithms
(newer and long)
(pdf,
ppt)
Approximation Algorithms as Experimental Probes of Informatics Graphs
(pdf,
ppt)
Community structure in large social and information networks
(talk from MMDS 2008)
(pdf,
ppt)
Community structure in large social and information networks
(older)
(pdf,
ppt)
Statistical leverage and improved matrix algorithms
(older and short)
(pdf,
ppt)
Sampling algorithms and coresets for Lp regression and applications
(pdf,
ppt)
CUR Matrix Decompositions for Improved Data Analysis (talk from MMDS 2006)
(pdf,
ppt)
A RelativeError CUR Decomposition for Matrices and Its Data Applications
(pdf,
ppt)
Sampling Algorithms for L2 Regression and Applications
(talk from SODA 2006)
(pdf,
ppt)
Approximating a Gram Matrix for Improved KernelBased Learning
(talk from COLT 2005)
(ps,
pdf)
Fast Monte Carlo Algorithms for Matrix Operations and Massive Data Set Analysis
(newer)
(
pdf,
ppt)
Fast Monte Carlo Algorithms for Matrix Operations and Massive Data Set Analysis
(older)
(
pdf)
CUR Matrix Decomposition with Applications to Algorithm Design and Massive Data Set Analysis
(pdf)
Fast Monte Carlo Algorithms for Massive Data Sets and Approximating MaxCut
(ps,
pdf)
The TIP5P Water talk:
The Computational Statistical Mechanics of Simple Models of Liquid Water
(
pdf)
Videos of several talks:
"Matrix factorizations at scale: a comparison of scientific data analytics on Spark and MPI"
(at the 2016 Dato Data Science Summit).
"Challenges in Multiresolution Methods for Graphbased Learning"
(at the NIPS Workshops, December 2015).
"Using Local Spectral Methods in Theory and in Practice"
(at the NIPS Workshops, December 2015).
"Linear and Sublinear Linear Algebra Algorithms: Preconditioning Stochastic Gradient Algorithms with Randomized Linear Algebra"
(at DIMACS, August 2015).
"Eigenvector localization and implicit regularization for largescale graphs and networked data"
(at CIMAT, March 2015).
"Locallybiased and semisupervised eigenvectors"
(at MMDS 2014, June 2014).
"Implementing Randomized Matrix Algorithms in Parallel and Distributed Environments"
(at the SF Bay Area DMSIG Meeting, June 2, 2014).
"BIG Biomedicine and the Foundations of BIG Data Analysis"
(at the Big Data in BioMedicine Conference at Stanford University, May 2014).
"Randomized Matrix Algorithms and Largescale Scientific Data Analysis"
(at the 2014 ASE Conference, May 31, 2014).
"Eigenvector localization, implicit regularization, and algorithmic antidifferentiation for largescale graphs and networked data"
(or try also here for another version)
(at ICERM and IMA, May and April 2014, respectively).
"Randomized matrix algorithms and largescale scientific data analysis"
(at the University of Michigan, April 2014).
"Recent Results in Randomized Numerical Linear Algebra"
(or click here)
(at the NIPS 2013 Workshops, December 2013).
"Implementing Randomized Matrix Algorithms in Parallel and Distributed Environments"
(at the Simons Big Data Workshop II, October 2013).
"Inputsparsity Time Algorithms for Embeddings and Regression Problems"
(at the Simons Big Data Workshop I, September 2013).
"Past, Present and Future of Randomized Numerical Linear Algebra:
Part I (PD) and
Part II" (MM)
(at the Simons Big Data Bootcamp, September 2013).
"Randomized Regression in Parallel and Distributed Environments"
(at the GraphLab Workshop, July 2013).
"Sensors, networks, and massive data"
or click
here
(at Kavli FoS, November 2012).
"Implementing Randomized Matrix Algorithms in Parallel and Distributed Environments"
(at MMDS 2012, July 2012).
"Extracting insight from large networks: implications of smallscale and largescale structure"
(at the University of Marlyand, April 2012).
"Fast Approximation of Matrix Coherence and Statistical Leverage"
(at the NIPS Workshops, December 2011).
"Approximate computation and implicit regularization in largescale data analysis"
or click
here
(at the Workshop on Beyond WorstCase Analysis, September 2011).
"Linear Algebra and Machine Learning of Large Informatics Graphs"
(at the NIPS Workshops, December 2010).
"Geometric Tools for Identifying Structure in Large Social and Information Networks"
(90 minute version, tutorial at SAMSI Opening workshop, August 2010).
"Geometric Tools for Graph Mining of Large Social and Information Networks"
(3 hour version, tutorial at KDD 2010, July 2010).
"Community Structure in Large Social and Information Networks"
(at the Newton Institute in Cambridge, June 2010).
"Algorithmic and Statistical Perspectives on LargeScale Data Analysis"
(at the SF Bay Area DMSIG Meeting, February 2010).
"Community Structure in Large Social and Information Networks,"
in avi
or mpg (note: slow to download),
(at the 2009 IIT Kanpur Processing Massive Data Sets Workshop, December 2009).
"Statistical Leverage and Improved Matrix Algorithms"
(at the 2009 ICML Workshops, June 2009).
"Sampling Algorithms and Coresets for Lp Regression and Applications,"
in mpg (note: slow to download),
(at the 2006 IIT Kanpur Data Streams Workshop, December 2006).
"CUR Matrix Decompositions for Improved Data Analysis"
(at Johns Hopkins University, March 2006).
"Fast Monte Carlo Algorithms for Matrix Operations and Massive Data Set Analysis"
(at the 2005 IPAM Summer School, July 2005).
My old web page
at Yale has additional (somewhat outdated) information;
and my less old web page
at Stanford has additional (somewhat less outdated) information;
but you should click on them (or Google search for them) if you are interested in
extremely large homes.
