Michael Mahoney's new home page

Basic info
I am at ICSI and the Department of Statistics at UC Berkeley, and
I am also in the
RISELab
(in the past AMPLab)
in the Department of Computer Science.
You can reach me via electronic mail:
mmahoney
is the username, and
stat dot berkeley dot edu
follows the at symbol.
(Of course, my gmail address is still valid.)
Also, some other (somewhat outdated) information can be found at
my old web page
at Stanford and
my older web page
at Yale.

Students, postdocs, etc.
Most of my work focuses on the theory and practice of what is now called big data, although I was doing it back when it was just massive, and prior to that when it was just large.
On the theory side, we develop algorithms and statistical methods for matrix, graph, regression, optimization, and related problems.
On the practice side, we provide implementations (e.g., on single machine, distributed data system, and supercomputer environments), and we also apply the methods to a range of problems in internet and social media analysis, social networks analysis, as well as genetics, mass spec imaging, astronomy, climate, and a range of other scientific applications.
I have been fortunate enough to work with a number of great students and postdocs over the years.
Zhewei Yao (student, UC Berkeley)
Feynman Liang (student, UC Berkeley)
Amir Gholaminejad (postdoc, UC Berkeley)
Kai Rothauge (postdoc, UC Berkeley)
Shusen Wang (postdoc, UC Berkeley)
Peng Xu (student, Stanford)
Aditya Devarakonda (UC Berkeley)
Julian Shun (postdoc, UC Berkeley, now faculty at MIT)
Kimon Fountoulakis (postdoc, UC Berkeley, now faculty at Waterloo)
Fred Roosta (postdoc, UC Berkeley, now faculty at Queensland)
Alex Gittens (postdoc, UC Berkeley; now faculty at RPI)
Di Wang (UC Berkeley PhD, 2017; now at Georgia Tech)
Jiyan Yang (Stanford PhD, 2016; now at Facebook)
Aaron Adcock (Stanford PhD, 2014; now at Facebook)
Xiangrui Meng (Stanford PhD, 2014; at LinkedIn, then Databricks)
Lorenzo Orecchia (intern, Yahoo; now faculty at Boston)
Jure Leskovec (intern, Yahoo; now faculty at Stanford)
Hari Narayanan (intern, Yahoo; faculty at Washington, then TIFR, Mumbai)
Jeff Phillips (intern, Yahoo; now faculty at Utah)
LekHeng Lim (intern, Yahoo; now faculty at Chicago)
Boulos Harb (intern, Yahoo; now at Google)
Research interests

Algorithmic and statistical aspects of modern largescale data analysis.

Randomized linear algebra and randomized numerical linear algebra.

Scalable convex and nonconvex optimization in parallel/distributed environments.

Implicit regularization and implicit optimization
methods in scalable approximation algorithms.

Graph approximation algorithms and applications to large social and information networks.

Applications to DNA microarray, SNP, astronomical, medical imaging, and other scientific data.
A lot of my work has focused on
Randomized Linear Algebra, i.e., using random sampling and
random projection methods to solve very large matrixbased problems;
developing
geometric network analysis tools, i.e., using scalable approximation
algorithms with a geometric
or statistical flavor to analyze the structure and dynamics of large
informatics graphs;
developing approximate computation and regularization methods for large
informatics graphs;
applications to community detection, clustering, and information dynamics in
large social and information networks; and
applications to DNA single nucleotide polymorphism (SNP) data,
astronomical and medical imaging data,
and largescale statistical data analysis more generally.
In the past, I developed and analyzed algorithms for large matrix, graph,
and regression problems, and I applied these and related tools to the
statistical data analysis of extremely large scientific and Internet data
sets.
For example, I worked on
largescale web analytics, machine learning, and query log analysis;
applications of graph partitioning algorithms to clustering and
community identification; and
applications of randomized matrix algorithms to hyperspectral medical image
data, DNA microarray data, and DNA SNP data.
In the more distant past, I have also worked on developing and analyzing
Monte Carlo algorithms for performing useful computations on extremely
large matrices,
e.g., the additiveerror and relativeerror CUR matrix decompositions.
Past research has also included work in computational statistical mechanics
on the
development and analysis of the
TIP5P
model of liquid water, as well as work in both computational and experimental
biophysics on proteins and proteinnucleic acid interactions.
Other items of interest
Software. (See publication list for code to reproduce results on any one paper.)
Alchemist project.
For more details, see the
RISE project page on Alchemist
or the KDD 2018 paper or the CUG 2018 paper.
Subsampled Secondorder nonconvex optimization.
For more details, see the
arXiv paper.
GPUaccelerated Subsampled Newton's Method.
For more details, see the
arXiv paper.
Local Graph Clustering.
For more details, see the
PIEEE paper.
Performance of linear algebra in Spark.
For more details, see the arXiv paper,
or the talk at the 2016 Dato Data Science Summit,
or the blog post by Alex Gittens.
LSRN: the randomized leastsquares solver for parallel environments.
For more details, see the
LSRN paper.
Teaching.
Spring 2018:
Linear Algebra for Data Science:
A fourunit introductory class on linear algebra, from the perspective of probablity and related topics in the mathematics of data.
(Expanded from a twounit "connector" class on mathematics for matrices/graphs/data that I designes and taught during
Spring 2017 and Spring 2016.)
(I have a detailed but workinprogress set of course notes for this class.
I will post them at an appropriate point; but, in the meantime, email me if you are interested in them.)
Spring 2015:
"Stat260/CS294: Topics in Spectral Graph Methods", an overview of practical spectral graph theory.
Update: full set of course lectures.
Fall 2013:
"Stat260/CS294: Randomized Algorithms for Matrices and Data", an overview of Randomized Linear Algebra.
Update: full set of course lectures.
Fall 2009:
CS369M: Algorithms for Massive Data Set Analysis
Workshops.
"Foundations of Data Science":
Fall 2018 program at the Simons Institute at UC Berkeley.
MMDS 2016, 2014, ...:
for more information, see the
main MMDS web page,
or the MMDS Video Library.
"Mathematics of Data":
2016 Summer School at the PCMI (Park City Mathematics Institute);
for more information, see the
flyer and the (upcoming) edited volume of lectures.
"RandNLA: Randomization in Numerical Linear Algebra":
2015 Gene Golub SIAM Summer School, June 2015, in Delphi, Greece;
for more information, see the
flyer, or
the SIAM News article about it.
"Theoretical Foundations of Big Data Analysis":
Fall 2013 program at the Simons Institute at UC Berkeley.
FODA/TRIPODS.
I am the PI on the new UC Berkeley FODA (Foundations of Data Analysis) Institute grant, which is part of the NSF TRIPODS program, to deepen the theoretical foundations of data science in a new transdisciplinary institute.
Also involved are coPIs Bin Yu, Fernando Perez, Michael Jordan, and Dick Karp.
Here is more information about it:
The main FODA web page.
The UC Berkeley press release about it.
The summary article from Datanami about it.
The original NSF announcement.
The the report from the planning workshop.
Some press.
RISELab challenges, as described in our
tecnical report.
Video synopsis of the Communications of the ACM article on "RandNLA: Randomized Numerical Linear Algebra."
Postdoc Julian Shun who has been awarded the ACM's doctoral dissertation award for his 2015 CMU doctoral thesis "SharedMemory Parallelism Can Be Simple, Fast, and Scalable".
AMPLabCrayNERSC/LBL collaboration, on HPC Wire, which describes our project to implement randomized linear algebra (RLA) algorithms in Spark on HPC platforms for LBL's scientific data applications.
Role of RLA in HPC, as described HPC Wire.
Miscellaneous.
Charles Martin gave a talk on our recent work on "Why Deep Learning Works: Self Regularization in Neural Networks" at LBNL, June 2018.
Kai Rothauge gave a talk on our recent work on "Alchemist: An Apache Spark <=> MPI Interface" at the Spark+AI meeting, June 2018.
I gave a talk on our recent work on "Alchemist: An Apache Spark <=> MPI Interface" at the XLDB meeting, May 2018.
I attended and gave a talk on "BIG Biomedicine and the Foundations of BIG Data Analysis"
at Stanford University at Stanford Medical School, May 23, 2014,
at their "Big Data in BioMedicine" Conference.
Click
"here"
for a video of the talk.
They also got two nice CEOstyle pictures:
here
I am, and
here
I am again.
Click
here
for information (including the slides and video!) on the Tutorial on
"Geometric Tools for Identifying Structure in Large Social and Information
Networks," given originally at ICML10 and KDD10 and subsequently at many
other places. (The slides are also linked to below.)
We have made public many of the networks we have used in our "Community
Structure" papers.
Click
here
for the networks and for related information.
Publications
2018

Alchemist: An Apache Spark <=> MPI Interface,

A. Gittens, K. Rothauge, M. W. Mahoney, S. Wang, L. Gerhardt, Prabhat, J. Kottalam, M. Ringenburg, and K. Maschhoff,

Technical Report, Preprint: arXiv:1806.01270 (2018)
(arXiv),

Accepted for publication, CUG 2018.

Accelerating LargeScale Data Analysis by Offloading to HighPerformance Computing Libraries using Alchemist,

A. Gittens, K. Rothauge, S. Wang, M. W. Mahoney, L. Gerhardt, Prabhat, J. Kottalam, M. Ringenburg, and K. Maschhoff,

Technical Report, Preprint: arXiv:1805.11800 (2018)
(arXiv),

Accepted for publication, KDD 2018.

Group Collaborative Representation for Image Set Classification,

B. Liu, L. Jing, J. Li, J. Yu, A. Gittens, and M. W. Mahoney,

Accepted for publication, International Journal of Computer Vision.

Error Estimation for Randomized LeastSquares Algorithms via the Bootstrap,

M. E. Lopes, S. Wang, M. W. Mahoney,

Technical Report, Preprint: arXiv:1803.08021 (2018)
(arXiv),

Accepted for publication, ICML 2018,

Journal version submitted for publication.

GPU Accelerated SubSampled Newton's Method,

S. B. Kylasa, F. RoostaKhorasani, M. W. Mahoney, and A. Grama,

Technical Report, Preprint: arXiv:1802.09113 (2018)
(arXiv),
(code),

Submitted for publication.

Hessianbased Analysis of Large Batch Training and Robustness to Adversaries,

Z. Yao, A. Gholami, Q. Lei, K. Keutzer, and M. W. Mahoney,

Technical Report, Preprint: arXiv:1802.08241 (2018)
(arXiv),

Submitted for publication.

Inexact NonConvex NewtonType Methods,

Z. Yao, P. Xu, F. RoostaKhorasani, and M. W. Mahoney,

Technical Report, Preprint: arXiv:1802.06925 (2018)
(arXiv),

Submitted for publication.

Outofsample extension of graph adjacency spectral embedding,

K. Levin, F. RoostaKhorasani, M. W. Mahoney, and C. E. Priebe,

Technical Report, Preprint: arXiv:1802.06307 (2018)
(arXiv),

Accepted for publication, ICML 2018,

Journal version submitted for publication.
2017

Lectures on Randomized Numerical Linear Algebra,

P. Drineas and M. W. Mahoney,

Technical Report, Preprint: arXiv:1712.08880 (2017)
(arXiv),

To appear in: Lectures of the 2016 PCMI Summer School on Mathematics of Data.

Avoiding Synchronization in FirstOrder Methods for Sparse Convex Optimization,

A. Devarakonda, K. Fountoulakis, J. Demmel, and M. W. Mahoney,

Technical Report, Preprint: arXiv:1712.06047 (2017)
(arXiv),

Accepted for publication, Proc. IPDPS 2018.

Rethinking generalization requires revisiting old ideas: statistical mechanics approaches and complex learning behavior,
(click here for a blog about this paper)

C. H. Martin and M. W. Mahoney,

Technical Report, Preprint: arXiv:1710.09553 (2017)
(arXiv),

LASAGNE: Locality And Structure Aware Graph Node Embedding,

E. Faerman, F. Borutta, K. Fountoulakis, and M. W. Mahoney,

Technical Report, Preprint: arXiv:1710.06520 (2017)
(arXiv),

A Berkeley View of Systems Challenges for AI,

I. Stoica, D. Song, R. A. Popa, D. A. Patterson, M. W. Mahoney, R. H. Katz, A. D. Joseph, M. Jordan, J. M. Hellerstein, J. Gonzalez, K. Goldberg, A. Ghodsi, D. E. Culler, and P. Abbeel,

Technical Report No. UCB/EECS2017159, October 2017
(www),

GIANT: Globally Improved Approximate Newton Method for Distributed Optimization,

S. Wang, F. RoostaKhorasani, P. Xu, and M. W. Mahoney,

Technical Report, Preprint: arXiv:1709.03528 (2017)
(arXiv),
(Spark code),
(Python code),

Submitted for publication.

SecondOrder Optimization for NonConvex Machine Learning: An Empirical Study,

P. Xu, F. RoostaKhorasani, and M. W. Mahoney,

Technical Report, Preprint: arXiv:1708.07827 (2017)
(arXiv),
(code),

Submitted for publication.

NewtonType Methods for NonConvex Optimization Under Inexact Hessian Information,

P. Xu, F. RoostaKhorasani, and M. W. Mahoney,

Technical Report, Preprint: arXiv:1708.07164 (2017)
(arXiv),

Submitted for publication.

A Bootstrap Method for Error Estimation in Randomized Matrix Multiplication,

M. E. Lopes, S. Wang, and M. W. Mahoney,

Technical Report, Preprint: arXiv:1708.01945 (2017)
(arXiv),

Submitted for publication.

Capacity releasing diffusions for speed and locality,

D. Wang, K. Fountoulakis, M. Henzinger, M. W. Mahoney, and S. Rao,

Technical Report, Preprint: arXiv:1706.05826 (2017)
(arXiv),

Proc. of the 34th ICML Conference 35983607 (2017)
(pdf),
(supp).

Scalable Kernel KMeans Clustering with Nystrom Approximation: RelativeError Bounds,

S. Wang, A. Gittens, and M. W. Mahoney,

Technical Report, Preprint: arXiv:1706.02803 (2017)
(arXiv),

Accepted for publication, J. Machine Learning Research.

Union of Intersections (UoI) for Interpretable Data Driven Discovery and Prediction,

K. E. Bouchard, A. F. Bujan, F. RoostaKhorasani, S. Ubaru, Prabhat, A. M. Snijders, J.H. Mao, E. F. Chang, M. W. Mahoney, S. Bhattacharyya,

Technical Report, Preprint: arXiv:1705.07585 (2017)
(arXiv),

Proc. of the 2017 NIPS Conference
(pdf).

SkipGram  Zipf + Uniform = Vector Additivity,

A. Gittens, D. Achlioptas, and M. W. Mahoney,

Proc. of the 55th ACL Meeting 6976 (2017)
(pdf).

Principles and Applications of Science of Information [Scanning the Issue],

T. Courtade, A. Grama, M. W. Mahoney, and T. Weissman,

Proceedings of the IEEE, 105(2): 183188 (2017)
(pdf).

Social Discrete Choice Models,

D. Zhang, K. Fountoulakis, J. Cao, M. Yin, M. W. Mahoney, and A. Pozdnoukhov,

Technical Report, Preprint: arXiv:1703.07520 (2017)
(arXiv),

Submitted for publication.

Sketched Ridge Regression: Optimization Perspective, Statistical Perspective, and Model Averaging,

S. Wang, A. Gittens, and M. W. Mahoney,

Technical Report, Preprint: arXiv:1702.04837 (2017)
(arXiv),

Proc. of the 34th ICML Conference 36083616 (2017)
(pdf),

J. Machine Learning Research, 18(218): 150 (2018)
(pdf).
2016

Avoiding communication in primal and dual block coordinate descent methods,

A. Devarakonda, K. Fountoulakis, J. Demmel, and M. W. Mahoney,

Technical Report, Preprint: arXiv:1612.04003 (2016)
(arXiv),

Submitted for publication.

Featuredistributed sparse regression: a screenandclean approach,

J. Yang, M. W. Mahoney, M. A. Saunders, and Y. Sun,

Proc. of the 2016 NIPS Conference
(pdf).

Multilabel learning with semantic embeddings,

L. Jing, M. Cheng, L. Yang, A. Gittens, M. W. Mahoney,

ICLR 2017 OpenReview.net
(www),

Mapping the Similarities of Spectra: Global and Locallybiased Approaches to SDSS Galaxy Data,

D. Lawlor, T. Budavari, and M. W. Mahoney,

Technical Report, Preprint: arXiv:1609.03932 (2016)
(arXiv),

The Astrophysical Journal, 833:1, 26 (2016)
(pdf).

Lecture Notes on Spectral Graph Methods,

M. W. Mahoney,

Technical Report, Preprint: arXiv:1608.04845 (2016)
(arXiv),

Lecture Notes on Randomized Linear Algebra,

M. W. Mahoney,

Technical Report, Preprint: arXiv:1608.04481 (2016)
(arXiv),

An optimization approach to locallybiased graph algorithms,

K. Fountoulakis, D. F. Gleich, M. W. Mahoney,

Technical Report, Preprint: arXiv:1607.04940 (2016)
(arXiv),

Proceedings of the IEEE, 105(2): 256272 (2017)
(pdf).

DCAR: A Discriminative and Compact Audio Representation to Improve Event Detection,

L. Jing, B. Liu, J. Choi, A. Janin, J. Bernd, M. W. Mahoney, and G. Friedland,

Technical Report, Preprint: arXiv:1607.04378 (2016)
(arXiv),

Proc. of the 2016 ACM Multimedia Conference 5761 (2016)
(pdf),

IEEE Transactions on Multimedia, 19(12): 26372650 (2017)
(pdf).

Matrix Factorization at Scale: a Comparison of Scientific Data Analytics in Spark and C+MPI Using Three Case Studies,

A. Gittens, A. Devarakonda, E. Racah, M. Ringenburg, L. Gerhardt, J. Kottalam, J. Liu, K. Maschhoff, S. Canon, J. Chhugani, P. Sharma, J. Yang, J. Demmel, J. Harrell, V. Krishnamurthy, M. W. Mahoney, and Prabhat,

Technical Report, Preprint: arXiv:1607.01335 (2016)
(arXiv),
(code),

Proc. 2016 IEEE BigData, 204213 (2016)
(pdf).

Subsampled Newton Methods with Nonuniform Sampling,

P. Xu, J. Yang, F. RoostaKhorasani, C. Re, and M. W. Mahoney,

Technical Report, Preprint: arXiv:1607.00559 (2016)
(arXiv),

Proc. of the 2016 NIPS Conference
(pdf).

Approximating the Solution to Mixed Packing and Covering LPs in parallel
time,

M. W. Mahoney, S. Rao, D. Wang, and P. Zhang,

Proc. of the 43rd ICALP Conference, 52:152:14 (2016)
(pdf).

A Simple and StronglyLocal FlowBased Method for Cut Improvement,

N. Veldt, D. F. Gleich, and M. W. Mahoney,

Technical Report, Preprint: arXiv:1605.08490 (2016)
(arXiv),

Proc. of the 33rd ICML Conference 19381947 (2016)
(pdf),
(supp).

RandNLA: Randomized Numerical Linear Algebra,

P. Drineas and M. W. Mahoney,

Communications of the ACM, 59, 8090 (2016)
(pdf).

FLAG n' FLARE: Fast LinearlyCoupled Adaptive Gradient Methods,

X. Cheng, F. RoostaKhorasani, S. Palombo, P. L. Bartlett, and M. W. Mahoney,

Technical Report, Preprint: arXiv:1605.08108 (2016)
(arXiv),

Proc. of the 21st International Conference on AISTATS, PMLR 84:404414 (2018)
(pdf,
supp).

Parallel Local Graph Clustering,

J. Shun, F. RoostaKhorasani, K. Fountoulakis, and M. W. Mahoney,

Technical Report, Preprint: arXiv:1604.07515 (2016)
(arXiv),

Proceedings of the VLDB Endowment, 9(12) 10411052 (2016)
(pdf).

A multiplatform evaluation of the randomized CX lowrank matrix factorization in Spark,

A. Gittens, J. Kottalam, J. Yang, M. F. Ringenburg, J. Chhugani, E. Racah, M. Singh, Y. Yao, C. Fischer, O. Ruebel, B. Bowen, N. G. Lewis, M. W. Mahoney, V. Krishnamurthy, and Prabhat,

Proc. 5th International Workshop on Parallel and Distributed Computing for Large Scale Machine Learning and Big Data Analytics, at IPDPS,
2016
(pdf).

Mining Large Graphs,

D. F. Gleich and M. W. Mahoney,

In
Handbook of Big Data.
pp. 191220,
edited by
P. Buhlmann, P. Drineas, M. Kane, and M. van de Laan,
Chapman and Hall/CRC Press,
2016
(pdf).

Structural properties underlying highquality Randomized Numerical Linear Algebra algorithms,

M. W. Mahoney and P. Drineas,

In
Handbook of Big Data.
pp. 137154,
edited by
P. Buhlmann, P. Drineas, M. Kane, and M. van de Laan,
Chapman and Hall/CRC Press,
2016
(pdf).

Variational Perspective on Local Graph Clustering,

K. Fountoulakis, X. Cheng, J. Shun, F. RoostaKhorasani and M. W. Mahoney,

Technical Report, Preprint: arXiv:1602.01886 (2016)
(arXiv),

Accepted for publication, Mathematical Programming.

SubSampled Newton Methods II: Local Convergence Rates,

F. RoostaKhorasani and M. W. Mahoney,

Technical Report, Preprint: arXiv:1601.04738 (2016)
(arXiv),

Journal version submitted for publication.

SubSampled Newton Methods I: Globally Convergent Algorithms,

F. RoostaKhorasani and M. W. Mahoney,

Technical Report, Preprint: arXiv:1601.04737 (2016)
(arXiv),

Journal version submitted for publication.

RandNLA, Pythons, and the CUR for Your Data Problems: Reporting from G2S3 2015 in Delphi,

E. Gallopoulos, P. Drineas, I. Ipsen, and M. W. Mahoney,

SIAM News 49:1 January/February 2016
(web),
(pdf).
2015

Faster Parallel Solver for Positive Linear Programs via DynamicallyBucketed Selective Coordinate Descent,

D. Wang, M. W. Mahoney, N. Mohan, and S. Rao,

Technical Report, Preprint: arXiv:1511.06468 (2015)
(arXiv).

A Local Perspective on Community Structure in Multilayer Networks,

L. G. S. Jeub, M. W. Mahoney, P. J. Mucha, and M. A. Porter,

Technical Report, Preprint: arXiv:1510.05185 (2015)
(arXiv),

Network Science, 5(2): 144163, 2017
(pdf).

Optimal Subsampling Approaches for Large Sample Linear Regression,

R. Zhu, P. Ma, M. W. Mahoney, and B. Yu,

Technical Report, Preprint: arXiv:1509.05111 (2015)
(arXiv).

Unified Acceleration Method for Packing and Covering Problems via Diameter Reduction,

D. Wang, S. Rao, and M. W. Mahoney,

Technical Report, Preprint: arXiv:1508.02439 (2015)
(arXiv),

Proc. of the 43rd ICALP Conference, 50:150:13 (2016)
(pdf).

Using local spectral methods to robustify graphbased learning algorithms,

D. F. Gleich and M. W. Mahoney,

Proc. of the 21st Annual SIGKDD, (2015)
(pdf)
(code).

Structured Block Basis Factorization for Scalable Kernel Matrix Evaluation,

R. Wang, Y. Li, M. W. Mahoney, and E. Darve,

Technical Report, Preprint: arXiv:1502.03571 (2015)
(arXiv).

Identifying important ions and positions in mass spectrometry imaging data using CUR matrix decompositions,

J. Yang, O. Rubel, Prabhat, M. W. Mahoney, and B. P. Bowen,

Analytical Chemistry, 87 (9), 46584666 (2015)
(pdf)
(code).

Spectral Gap Error Bounds for Improving CUR Matrix Decomposition and the Nystrom Method,

D. G. Anderson, S. S. Du, M. W. Mahoney, C. Melgaard, K. Wu, and M. Gu,

Proc. of the 18th International Conference on AISTATS, PMLR 38:1927 (2015)
(pdf,
supp)
(code).

Weighted SGD for Lp Regression with Randomized Preconditioning,

J. Yang, Y.L. Chow, C. Re, and M. W. Mahoney,

Technical Report, Preprint: arXiv:1502.03571 (2015)
(arXiv),

Proc. of the 27th Annual SODA, 558569 (2016)
(pdf),

J. Machine Learning Research, 18(211): 143 (2018)
(pdf).

Implementing Randomized Matrix Algorithms in Parallel and Distributed Environments,

J. Yang, X. Meng, and M. W. Mahoney,

Technical Report, Preprint: arXiv:1502.03032 (2015)
(arXiv)
(code),

Proceedings of the IEEE 104(1): 5892 (2016)
(pdf).
2014

Tree decompositions and social graphs,

A. B. Adcock, B. D. Sullivan, and M. W. Mahoney,

Technical Report, Preprint: arXiv:1411.1546 (2014)
(arXiv),
(code).

Internet Mathematics, 12(5), 315361 (2016)
(pdf).

Fast Randomized Kernel Methods With Statistical Guarantees,

A. El Alaoui and M. W. Mahoney,

Technical Report, Preprint: arXiv:1411.0306 (2014)
(arXiv),

Proc. of the 2015 NIPS Conference
(pdf).

Signal Processing for Big Data (Editorial for Special Issue)

G. B. Giannakis, F. Bach, R. Cendrillon, M. Mahoney, and J. Neville,

IEEE Signal Processing Magazine, 31: 1516 (September 2014)
(pdf).

A Statistical Perspective on Randomized Sketching for Ordinary LeastSquares,

G. Raskutti and M. W. Mahoney,

Technical Report, Preprint: arXiv:1406.5986 (2014)
(arXiv),

Proc. of the 32nd ICML Conference (2015)
(pdf),

J. Machine Learning Research, 17(214): 131, 2016
(pdf).

Random Laplace Feature Maps for Semigroup Kernels on Histograms,

J. Yang, V. Sindhwani, Q. Fan, H. Avron, and M. W. Mahoney,

Proc. of the 27th CVPR Conference, 971978 (2014)
(pdf).

Antidifferentiating Approximation Algorithms: A case study with Mincuts, Spectral, and Flow,

D. F. Gleich and M. W. Mahoney,

Proc. of the 31st ICML Conference, JMLR W&CP 32 (1): 10181025 (2014)
(pdf)
(code, code)
(talk).

QuasiMonte Carlo Feature Maps for ShiftInvariant Kernels,

J. Yang, V. Sindhwani, H. Avron, and M. W. Mahoney,

Proc. of the 31st ICML Conference, JMLR W&CP 32 (1): 485493 (2014)
(pdf),
(code),

Technical Report, Preprint: arXiv:1412.8293 (2014)
(arXiv),

J. Machine Learning Research, 17(120): 138 (2016)
(pdf).

Think Locally, Act Locally: The Detection of Small, MediumSized, and Large Communities in Large Networks,

L. G. S. Jeub, P. Balachandran, M. A. Porter, P. J. Mucha, and M. W. Mahoney,

Technical Report, Preprint: arXiv:1403.3795 (2014)
(arXiv),
(code, code),

Physical Review E, 91, 012821 (2015)
(pdf).

A new spin on an old algorithm: technical perspective on "Communication costs of Strassen's matrix multiplication,"

M. W. Mahoney,

Communications of the ACM, 57(2): 106 (2014)
(pdf).
2013

Treelike Structure in Large Social and Information Networks,

A. B. Adcock, B. D. Sullivan, and M. W. Mahoney,

Proc. of the 2013 IEEE ICDM, 110 (2013)
(pdf).

Objective Identification of Informative Wavelength Regions in Galaxy Spectra,

C.W. Yip, M. W. Mahoney, A. S. Szalay, I. Csabai, T. Budavari, R. F. G. Wyse,
and L. Dobos,

Technical Report, Preprint: arXiv:1312.0637 (2013)
(arXiv),

Astronomical Journal, 147, 5, 110 (2014)
(pdf).

Evaluating OpenMP Tasking at Scale for the Computation of Graph Hyperbolicity,

A. B. Adcock, B. D. Sullivan, O. R. Hernandez, and M. W. Mahoney,

Proc. of the 9th IWOMP, 7183 (2013)
(pdf).

Frontiers in Massive Data Analysis,

Committee on the Analysis of Massive Data, et al. (M. I. Jordan, et al.),

The National Academies Press (2013)
(pdf),
(web).

A Statistical Perspective on Algorithmic Leveraging,

P. Ma, M. W. Mahoney, and B. Yu,

Technical Report, Preprint: arXiv:1306.5362 (2013)
(arXiv),

Proc. of the 31st ICML Conference, JMLR W&CP 32 (1): 9199 (2014)
(pdf),

J. Machine Learning Research, 16, 861911 (2015)
(pdf).

Robust Regression on MapReduce,

X. Meng, and M. W. Mahoney,

Proc. of the 30th ICML Conference, JMLR W&CP 28(3): 888896 (2013)
(pdf).

Quantile Regression for Largescale Applications,

J. Yang, X. Meng, and M. W. Mahoney,

Technical Report, Preprint: arXiv:1305.0087 (2013)
(arXiv),
(code),

Proc. of the 30th ICML Conference, JMLR W&CP 28(3): 881887 (2013)
(pdf),

SIAM J. Scientific Computing, 36(5), S78S110 (2014)
(pdf).

Revisiting the Nystrom Method for Improved LargeScale Machine Learning,

A. Gittens and M. W. Mahoney,

Technical Report, Preprint: arXiv:1303.1849 (2013)
(arXiv),
(code),

Proc. of the 30th ICML Conference, JMLR W&CP 28(3): 567575 (2013)
(pdf),

J. Machine Learning Research, 17(117): 165 (2016)
(pdf).
2012

Semisupervised Eigenvectors for Largescale Locallybiased Learning,

T. J. Hansen and M. W. Mahoney,

Proc. of the 2012 NIPS Conference
(pdf),
(code),

Technical Report, Preprint: arXiv:1304.7528 (2013)
(arXiv),

J. Machine Learning Research, 15, 36913734 (2014)
(pdf).

Lowdistortion Subspace Embeddings in Inputsparsity Time and Applications to Robust Linear Regression,

X. Meng and M. W. Mahoney,

Technical Report, Preprint: arXiv:1210.3135 (2012)
(arXiv),

Proc. of the 45th STOC, 91100 (2013)
(pdf).

The Fast Cauchy Transform and Faster Robust Linear Regression,

K. L. Clarkson, P. Drineas, M. MagdonIsmail, M. W. Mahoney, X. Meng, and D. P. Woodruff,

Technical Report, Preprint: arXiv:1207.4684 (2012)
(arXiv),

Proc. of the 24th Annual SODA, 466477 (2013)
(pdf),

SIAM J. Computing, 45, 763810 (2016)
(pdf).

rCUR: an R package for CUR matrix decomposition,

A. Bodor, I. Csabai, M. W. Mahoney, and N. Solymosi,

BMC Bioinformatics, 13:103 (2012)
(pdf),
(code).

Approximate Computation and Implicit Regularization for Very Largescale Data Analysis,

M. W. Mahoney,

Technical Report, Preprint: arXiv:1203.0786 (2012)
(arXiv),

Proc. of the 2012 ACM Symposium on Principles of Database Systems, 143154, 2012
(pdf).

On the Hyperbolicity of SmallWorld and TreeLike Random Graphs,

W. Chen, W. Fang, G. Hu, and M. W. Mahoney,

Technical Report, Preprint: arXiv:1201.1717 (2012)
(arXiv),

Proc. of the 23rd ISAAC 278288 (2012)
(pdf),

Internet Mathematics, 9(4), 434491 (2013)
(pdf).
2011

Randomized Dimensionality Reduction for Kmeans Clustering,

C. Boutsidis, A. Zouzias, M. W. Mahoney, and P. Drineas,

Technical Report, Preprint: arXiv:1110.2897 (2011)
(arXiv),

IEEE Transactions on Information Theory, 61(2), 10451062 (2015)
(pdf).

Regularized Laplacian Estimation and Fast Eigenvector Approximation,

P. O. Perry and M. W. Mahoney,

Technical Report, Preprint: arXiv:1110.1757 (2011)
(arXiv),

Proc. of the 2011 NIPS Conference
(pdf).

LSRN: A Parallel Iterative Solver for Strongly Over or UnderDetermined Systems,

X. Meng, M. A. Saunders, and M. W. Mahoney,

Technical Report, Preprint: arXiv:1109.5981 (2011)
(arXiv),
(code),

SIAM J. Scientific Computing, 36(2), C95C118 (2014)
(pdf).

Fast approximation of matrix coherence and statistical leverage,

P. Drineas, M. MagdonIsmail, M. W. Mahoney, and D. P. Woodruff,

Technical Report, Preprint: arXiv:1109.3843 (2011)
(arXiv),

Proc. of the 29th ICML Conference (2012)
(pdf),

J. Machine Learning Research, 13, 34753506 (2012)
(pdf).

Localization on loworder eigenvectors of data matrices,

M. Cucuringu and M. W. Mahoney,

Technical Report, Preprint: arXiv:1109.1355 (2011)
(arXiv).

Efficient Genomewide Selection of PCACorrelated tSNPs for Genotype Imputation,

A. Javed, P. Drineas, M. W. Mahoney, and P. Paschou,

Annals of Human Genetics, 75, 707722 (2011)
(pdf).

Randomized Algorithms for Matrices and Data,

M. W. Mahoney,

Foundations and Trends in Machine Learning,
NOW Publishers,
Volume 3, Issue 2, 2011
(now),

TR version:
Technical Report, Preprint: arXiv:1104.5557 (2011)
(arXiv).

(Abridged version in:
Advances in Machine Learning and Data Mining for Astronomy,
edited by
M. J. Way, et al.,
pp. 647672,
2012.)
2010

Computation in LargeScale Scientific and Internet Data Applications is a Focus of MMDS 2010,

M. W. Mahoney,

Technical Report, Preprint: arXiv:1012.4231 (2010)
(arXiv),

Appeared in
SIGKDD Explorations,
SIGACT News,
ASASCGN Newsletter,
and IMS Bulletin.

CUR from a Sparse Optimization Viewpoint,

J. Bien, Y. Xu, and M. W. Mahoney,

Technical Report, Preprint: arXiv:1011.0413 (2010)
(arXiv),

Proc. of the 2010 NIPS Conference
(ps,
pdf).

Algorithmic and Statistical Perspectives on LargeScale Data Analysis,

M. W. Mahoney,

Technical Report, Preprint: arXiv:1010.1609 (2010)
(arXiv),

In:
Combinatorial Scientific Computing,
pp. 427469,
edited by
U. Naumann and O. Schenk,
2012.

Implementing regularization implicitly via approximate eigenvector computation,

M. W. Mahoney and L. Orecchia,

Technical Report, Preprint: arXiv:1010.0703 (2010)
(arXiv),

Proc. of the 28th ICML Conference, 121128 (2011)
(pdf)
(talk).

Approximating HigherOrder Distances Using Random Projections,

P. Li, M. W. Mahoney, and Y. She,

Proc. of the 26th UAI Conference, 312321 (2010)
(ps,
pdf),

Technical Report, Preprint: arXiv:1203.3492 (2012)
(arXiv).

Effective Resistances, Statistical Leverage, and Applications to Linear Equation Solving,

P. Drineas and M. W. Mahoney,

Technical Report, Preprint: arXiv:1005.3097 (2010)
(arXiv).

Empirical Comparison of Algorithms for Network Community Detection,

J. Leskovec, K. J. Lang, and M. W. Mahoney,

Technical Report, Preprint: arXiv:1004.3539 (2010)
(arXiv),

Proc. of the 19th International WWW, 631640 (2010)
(ps,
pdf).
2009

A Local Spectral Method for Graphs: with Applications to Improving Graph
Partitions and Exploring Data Graphs Locally,

M. W. Mahoney, L. Orecchia, and N. K. Vishnoi,

Technical Report, Preprint: arXiv:0912.0681 (2009)
(arXiv),

J. Machine Learning Research, 13, 23392365 (2012)
(ps,
pdf).

Unsupervised Feature Selection for the kmeans Clustering Problem,

C. Boutsidis, M. W. Mahoney, and P. Drineas,

Proc. of the 2009 NIPS Conference
(ps,
pdf).

Learning with Spectral Kernels and HeavyTailed Data,

M. W. Mahoney and H. Narayanan,

Technical Report, Preprint: arXiv:0906.4539 (2009)
(arXiv).

Empirical Evaluation of Graph Partitioning Using Spectral Embeddings and Flow,

K. J. Lang, M. W. Mahoney, and L. Orecchia,

Proc. of the 8th International SEA, 197208 (2009)
(ps,
pdf).

CUR Matrix Decompositions for Improved Data Analysis,

M. W. Mahoney and P. Drineas,

Proc. Natl. Acad. Sci. USA, 106, 697702 (2009)
(ps,
pdf).
2008

An Improved Approximation Algorithm for the Column Subset Selection Problem,

C. Boutsidis, M. W. Mahoney, and P. Drineas,

Technical Report, Preprint: arXiv:0812.4293 (2008)
(arXiv),

Proc. of the 20th Annual SODA, 968977 (2009)
(ps,
pdf).

Algorithmic and Statistical Challenges in Modern LargeScale Data Analysis are the Focus of MMDS 2008

M. W. Mahoney, L.H. Lim, and G. E. Carlsson

Technical Report, Preprint: arXiv:0812.3702 (2008)
(arXiv),

Appeared in
SIGKDD Explorations
(ps,
pdf),
SIAM News
(ps,
pdf),
and
ASASCGN Newsletter
(ps,
pdf),
and abridged versions appeared in IMS Bulletin
(ps,
pdf)
and AmStat News.

Community Structure in Large Networks: Natural Cluster Sizes and the Absence of Large WellDefined Clusters,

J. Leskovec, K. J. Lang, A. Dasgupta, and M. W. Mahoney,

Technical Report, Preprint: arXiv:0810.1355 (2008)
(arXiv),

Internet Mathematics, 6(1), 29123 (2009)
(pdf).

Unsupervised Feature Selection for Principal Components Analysis,

C. Boutsidis, M. W. Mahoney, and P. Drineas,

Proc. of the 14th Annual SIGKDD, 6169 (2008)
(ps,
pdf).

Statistical Properties of Community Structure in Large Social and Information Networks,

J. Leskovec, K. J. Lang, A. Dasgupta, and M. W. Mahoney,

Proc. of the 17th International WWW, 695704 (2008)
(ps,
pdf).
2007

Faster Least Squares Approximation,

P. Drineas, M. W. Mahoney, S. Muthukrishnan, and T. Sarlos,

Technical Report, Preprint: arXiv:0710.1435 (2007)
(arXiv),

Numerische Mathematik, 117, 219249 (2011)
(pdf).

PCACorrelated SNPs for Structure Identification in Worldwide Human Populations,

P. Paschou, E. Ziv, E. G. Burchard, S. Choudhry, W. RodriguezCintron, M. W. Mahoney, and P. Drineas,

PLoS Genetics, 3, 16721686 (2007)
(ps,
pdf).

RelativeError CUR Matrix Decompositions,

P. Drineas, M. W. Mahoney, and S. Muthukrishnan,

Technical Report, Preprint: arXiv:0708.3696 (2007)
(arXiv),

SIAM J. Matrix Analysis and Applications, 30, 844881 (2008)
(ps,
pdf).

Feature Selection Methods for Text Classification,

A. Dasgupta, P. Drineas, B. Harb, V. Josifovski, and M. W. Mahoney,

Proc. of the 13th Annual SIGKDD, 230239 (2007)
(ps,
pdf).

Sampling Algorithms and Coresets for Lp Regression,

A. Dasgupta, P. Drineas, B. Harb, R. Kumar, and M. W. Mahoney,

Technical Report, Preprint: arXiv:0707.1714 (2007)
(arXiv),

Proc. of the 19th Annual SODA, 932941 (2008)
(ps,
pdf),

SIAM J. Computing, 38, 20602078 (2009)
(ps,
pdf).

Web Information Retrieval and Linear Algebra Algorithms,

A. Frommer, M. W. Mahoney, and D. B. Szyld (Eds.),

Proc. of Dagstuhl Seminar 07071, (2007)
(web).

Intra and interpopulation genotype reconstruction from tagging SNPs,

P. Paschou, M. W. Mahoney, A. Javed, J. R. Kidd, A. J. Pakstis, S. Gu, K. K. Kidd, and P. Drineas,

Genome Research, 17(1), 96107 (2007)
(ps,
pdf).
2006

Bridging the Gap Between Numerical Linear Algebra, Theoretical Computer Science, and Data Applications,

G. H. Golub, M. W. Mahoney, P. Drineas, and L.H. Lim,

SIAM News 39:8 October 2006
(ps,
pdf).

Randomized Algorithms for Matrices and Massive Data Sets,

P. Drineas and M. W. Mahoney,

Proc. of the 32nd Annual VLDB, 1269 (2006)
(ps,
pdf).

Subspace Sampling and RelativeError Matrix Approximation: ColumnRowBased Methods,

P. Drineas, M. W. Mahoney, and S. Muthukrishnan,

Proc. of the 14th Annual ESA, 304314 (2006)
(ps,
pdf).

Subspace Sampling and RelativeError Matrix Approximation: ColumnBased Methods,

P. Drineas, M. W. Mahoney, and S. Muthukrishnan,

Proc. of the 10th Annual RANDOM, 316326 (2006)
(ps,
pdf).

TensorCUR Decompositions For TensorBased Data,

M. W. Mahoney, M. Maggioni, and P. Drineas,

Proc. of the 12th Annual SIGKDD, 327336 (2006)
(ps,
pdf),

SIAM J. Matrix Analysis and Applications, 30, 957987 (2008)
(ps,
pdf).

Polynomial Time Algorithm for ColumnRowBased RelativeError LowRank Matrix Approximation,

P. Drineas, M. W. Mahoney, and S. Muthukrishnan,

Technical Report, DIMACS TR 200604 March 2006
(ps,
pdf).

Sampling Algorithms for L2 Regression and Applications,

P. Drineas, M. W. Mahoney, and S. Muthukrishnan,

Proc. of the 17th Annual SODA, 11271136 (2006)
(ps,
pdf).
2005

A Randomized Algorithm for a TensorBased Generalization of the Singular Value Decomposition,

P. Drineas and M. W. Mahoney,

Technical Report, YALEU/DCS/TR1327, June 2005
(ps,
pdf),

Linear Algebra and its Applications, 420, 553571 (2007)
(ps,
pdf).

On the Nystrom Method for Approximating a Gram Matrix for Improved KernelBased Learning,

P. Drineas and M. W. Mahoney,

Technical Report, YALEU/DCS/TR1319, April 2005
(ps,
pdf),

Proc. of the 18th Annual COLT, 323337 (2005)
(ps,
pdf),

J. Machine Learning Research, 6, 21532175 (2005)
(ps,
pdf).
2004

Sampling Subproblems of Heterogeneous MaxCut Problems and Approximation Algorithms,

P. Drineas, R. Kannan, and M. W. Mahoney,

Technical Report, YALEU/DCS/TR1283, April 2004
(ps,
pdf),

Proc. of the 22nd Annual STACS, 5768 (2005)
(ps,
pdf),

Random Structures and Algorithms, 32:3, 307333 (2008)
(ps,
pdf).

Fast Monte Carlo Algorithms for Matrices III: Computing an Efficient Approximate Decomposition of a Matrix,

P. Drineas, R. Kannan, and M. W. Mahoney,

Technical Report, YALEU/DCS/TR1271, February 2004
(ps,
pdf),

SIAM J. Computing, 36, 184206 (2006)
(ps,
pdf).

Fast Monte Carlo Algorithms for Matrices II: Computing LowRank Approximations to a Matrix,

P. Drineas, R. Kannan, and M. W. Mahoney,

Technical Report, YALEU/DCS/TR1270, February 2004
(ps,
pdf),

SIAM J. Computing, 36, 158183 (2006)
(ps,
pdf).

Fast Monte Carlo Algorithms for Matrices I: Approximating Matrix Multiplication,

P. Drineas, R. Kannan, and M. W. Mahoney,

Technical Report, YALEU/DCS/TR1269, February 2004
(ps,
pdf),

SIAM J. Computing, 36, 132157 (2006)
(ps,
pdf).
2003

Rapid Mixing of Several Markov Chains for a HardCore Model,

R. Kannan, M. W. Mahoney, and R. Montenegro,

Proc. of the 14th Annual ISAAC, 663675 (2003)
(pdf).
2001

Quantum, Intramolecular Flexibility, and Polarizability Effects on the Reproduction of the Density Anomaly of Liquid Water by Simple Potential Functions,

M. W. Mahoney and W. L. Jorgensen,

J. Chem. Phys., 115, 1075810768 (2001)
(pdf).

Rapid Estimation of Electronic Degrees of Freedom in Monte Carlo Calculations for Polarizable Models of Liquid Water,

M. W. Mahoney and W. L. Jorgensen,

J. Chem. Phys., 114, 93379349 (2001)
(pdf).

Diffusion Constant of the TIP5P Model of Liquid Water,

M. W. Mahoney and W. L. Jorgensen,

J. Chem. Phys., 114, 363366 (2001)
(pdf).
2000

A FiveSite Model for Liquid Water and the Reproduction of the Density Anomaly by Rigid, Nonpolarizable Potential Functions,

M. W. Mahoney and W. L. Jorgensen,

J. Chem. Phys., 112, 89108922 (2000)
(pdf).
1997

Repression and Activation of PromoterBound RNA Polymerase Activity by Gal Repressor,

H. E. Choy, R. R. Hanger, T. Aki, M. Mahoney, K. Murakami, A. Ishihama, and S. Adhya,

J. Mol. Biol. 272: 293300, 1997
(pdf).

Discrete Representations of the Protein Calpha Chain,

X. F. de la Cruz, M. W. Mahoney, and B. K. Lee,

Fold. & Des. 2: 223234, 1997
(pdf).
MMDS Workshops
I run the MMDS meetings.
We started the
MMDS Workshops on
"Algorithms for Modern Massive Data Sets"
to address
algorithmic and statistical
challenges in modern largescale statistical data analysis.

MMDS 2016
took place on the campus of UC Berkeley on June 2124, 2016.
See the
main MMDS web page
for more information.

MMDS 2014
took place on the campus of UC Berkeley on June 1720, 2014.
See the
main MMDS web page
for more information.
Click
here for the entire
video collection.

MMDS 2012
took place on the campus of Stanford University on July 1013, 2012.
For pdfs and videos of the presentations, go to the main MMDS web
page; or click
here for the entire
video collection.

MMDS 2010
took place on the campus of Stanford University on June 1518, 2010.
MMDS 2010 addressed computation in largescale scientific and internet data
applications more generally.
See
the MMDS web page
for details, including articles and all the speaker overheads!

MMDS 2008
took place on June 2528, 2008.
MMDS 2008 grew out of our expectation for what
the algorithmic and statistical
foundations of
largescale
data analysis
should look like a generation from now.
Click
here
for an article that appeared in SIGKDD Explorations and
SIAM News about the meeting.

MMDS 2006
took place on June 2124, 2006.
MMDS 2006 was originally motivated by the complementary perspectives
brought by numerical linear algebra and theoretical computer science to
matrix algorithms in largescale data applications.
Click
here
for an article in SIAM News about the meeting.
These MMDS meetings generated intense interdisciplinary interest and were a
big success  so keep an eye out for future MMDSs!
Talks and presentations
Several recent talks:
Alchemist: An Apache Spark <=> MPI Interface
(Jun 2018)
(pdf)
Biomedicine & the Foundations of Data?
(May 2018)
(pdf)
Scientific Machine Learning with Alchemist (An Apache Spark <=> MPI Interface) and Beyond
(Apr 2018)
(pdf)
Numericallyintensive Machine Learning at Scale
(Fall 2017)
(pdf)
Secondorder Machine Learning
(Fall 2017)
(pdf)
Several tutorial presentations:
Randomization in Numerical Linear Algebra: Theory and Practice
(2.0 hr version at SIAM ALA Meeting, October 2015)
(pdf)
Past, Present and Future of Randomized Numerical Linear Algebra:
(3.0 hr version at Simons' Institute 2013 Big Data Bootcamp)
(Part I:
pdf,
ppt
and Part II:
pdf,
ppt)
Theory (and some practice) of Randomized Algorithms for Matrices and Data
(tutorial from FOCS 2012 Workshop)
(pdf,
ppt)
Geometric Tools for Identifying Structure in Large Social and Information Networks
(1.5 hr version at SAMSI Opening Workshop 2010, etc.)
(pdf,
ppt)
Geometric Tools for Identifying Structure in Large Social and Information Networks
(2 hr version at ICASSP 2011, etc.)
(pdf,
ppt)
Geometric Tools for Identifying Structure in Large Social and Information Networks
(3 hr version at ICML 2010 and KDD 2010, etc.)
(pdf,
ppt)
(The pdf file in four pieces:
here,
here,
here,
and
here.)
Randomized Algorithms for Matrices and Massive Data Sets
(at SIAMSDM06 2006 and VLDB 2006)
(ppt)
Randomized Algorithms for Matrices and Massive Data Sets
(at ACMSIGKDD 2005)
(ppt)
Several other older talks:
UC Berkeley's FODA Institute: Foundations of Data Analysis
(NSF TRIPODS Kickoff, Oct 2017)
(pdf)
Secondorder Machine Learning
(Prefall 2017)
(pdf)
Local graph analytics: beyond characterizing community structure
(Spring 2017)
(pdf)
Terabytescale Computational Statistics
(talk Fall 2016)
(pdf)
Scientific Matrix Factorizations in Spark at Scale: Crossplatform performance, scaling, and comparisons with C+MPI
(talk at 2016 Dato Data Science Summit and elsewhere)
(pdf)
Optimization Algorithms for Analyzing Large Datasets
(talk at 2016 PCMI Summer School)
(pdf)
Foundations of Data Science
(talk at NSF preTRIPODS workshop/meeting, Apr 2016)
(pdf)
SubSampled Newton Methods
(talk at ITA 2016 and elsewhere)
(pdf)
Challenges in Multiresolution Methods for Graphbased Learning
(talk, 3of3, from NIPS15 Workshops)
(pdf)
Using Local Spectral Methods in Theory and in Practice
(talk, 2of3, from NIPS15 Workshops)
(pdf)
Column Subset Selection on Terabytesized Scientific Data
(talk, 1of3, from NIPS15 Workshops)
(pdf)
Linear and Sublinear Linear Algebra Algorithms: Preconditioning Stochastic Gradient Algorithms with Randomized Linear Algebra
(DIMACS, Aug 2015)
(pdf)
Overview of RandNLA: Randomized Numerical Linear Algebra
(pdf)
Treelike structure in social graphs
(pdf (big=28MB),
ppt (big=48MB))
Eigenvector localization, implicit regularization, and algorithmic antidifferentiation for largescale graphs and network data
(pdf,
ppt)
Locallybiased and semisupervised eigenvectors
(talk from MMDS 2014)
(pdf,
ppt)
Implicit regularization in sublinear approximation algorithms
(pdf,
ppt)
BIG Biomedicine and the Foundations of BIG Data Analysis (at Big Data in Biomedicine at Stanford's Medical School, 5/23/14)
(pdf,
ppt)
Revisiting the Nystrom Method for Improved LargeScale Machine Learning
(pdf,
ppt)
Implementing Randomized Matrix Algorithms in Parallel and Distributed Environments
(version from Simons Big Data Workshop II)
(pdf)
Inputsparsity Time Algorithms for Embeddings and Regression Problems
(talk from Simons Big Data Workshop I)
(pdf)
Randomized Regression in Parallel and Distributed Environments
(talk from GraphLab 2013)
(pdf)
Extracting insight from large networks: implications of smallscale and largescale structure
(pdf,
ppt)
Implementing Randomized Matrix Algorithms in Parallel and Distributed Environments
(version from MMDS 2012)
(pdf)
Sensors, networks, and massive data
(pdf,
ppt)
Randomized Algorithms for Matrices and Data
(pdf,
ppt)
Approximate computation and implicit regularization in largescale data analysis
(PODS vsn)
(pdf,
ppt)
Approximate computation and implicit regularization in largescale data analysis
(Stats vrsn1)
(pdf,
ppt)
Approximate computation and implicit regularization in largescale data analysis
(Short vrsn)
(pdf,
ppt)
Looking for clusters in your data ... in theory and in practice
(pdf,
ppt)
Fast Approximation of Matrix Coherence and Statistical Leverage
(pdf,
ppt)
Implementing regularization implicitly via approximate eigenvector computation
(pdf,
ppt)
Linear Algebra and Machine Learning of Large Informatics Graphs
(pdf,
ppt)
Geometric Network Analysis Tools
(talk from MMDS 2010)
(pdf,
ppt)
Algorithmic and Statistical Perspectives on LargeScale Data Analysis
(pdf,
ppt)
Community structure in large social and information networks
(newer)
(pdf,
ppt)
Statistical leverage and improved matrix algorithms
(newer and long)
(pdf,
ppt)
Approximation Algorithms as Experimental Probes of Informatics Graphs
(pdf,
ppt)
Community structure in large social and information networks
(talk from MMDS 2008)
(pdf,
ppt)
Community structure in large social and information networks
(older)
(pdf,
ppt)
Statistical leverage and improved matrix algorithms
(older and short)
(pdf,
ppt)
Sampling algorithms and coresets for Lp regression and applications
(pdf,
ppt)
CUR Matrix Decompositions for Improved Data Analysis (talk from MMDS 2006)
(pdf,
ppt)
A RelativeError CUR Decomposition for Matrices and Its Data Applications
(pdf,
ppt)
Sampling Algorithms for L2 Regression and Applications
(talk from SODA 2006)
(pdf,
ppt)
Approximating a Gram Matrix for Improved KernelBased Learning
(talk from COLT 2005)
(ps,
pdf)
Fast Monte Carlo Algorithms for Matrix Operations and Massive Data Set Analysis
(newer)
(pdf,
ppt)
Fast Monte Carlo Algorithms for Matrix Operations and Massive Data Set Analysis
(older)
(pdf)
CUR Matrix Decomposition with Applications to Algorithm Design and Massive Data Set Analysis
(pdf)
Fast Monte Carlo Algorithms for Massive Data Sets and Approximating MaxCut
(ps,
pdf)
The TIP5P Water talk:
The Computational Statistical Mechanics of Simple Models of Liquid Water
(pdf)
Videos of several talks:
"Biomedicine & the Foundations of Data?"
(at the BD2K Guide to the Fundamentals of Data Science seminar series).
"Matrix factorizations at scale: a comparison of scientific data analytics on Spark and MPI"
(at the 2016 Dato Data Science Summit).
"Challenges in Multiresolution Methods for Graphbased Learning"
(at the NIPS Workshops, December 2015).
"Using Local Spectral Methods in Theory and in Practice"
(at the NIPS Workshops, December 2015).
"Linear and Sublinear Linear Algebra Algorithms: Preconditioning Stochastic Gradient Algorithms with Randomized Linear Algebra"
(at DIMACS, August 2015).
"Eigenvector localization and implicit regularization for largescale graphs and networked data"
(at CIMAT, March 2015).
"Locallybiased and semisupervised eigenvectors"
(at MMDS 2014, June 2014).
"Implementing Randomized Matrix Algorithms in Parallel and Distributed Environments"
(at the SF Bay Area DMSIG Meeting, June 2, 2014).
"BIG Biomedicine and the Foundations of BIG Data Analysis"
(at the Big Data in BioMedicine Conference at Stanford University, May 2014).
"Randomized Matrix Algorithms and Largescale Scientific Data Analysis"
(at the 2014 ASE Conference, May 31, 2014).
"Eigenvector localization, implicit regularization, and algorithmic antidifferentiation for largescale graphs and networked data"
(or try also here for another version)
(at ICERM and IMA, May and April 2014, respectively).
"Randomized matrix algorithms and largescale scientific data analysis"
(at the University of Michigan, April 2014).
"Recent Results in Randomized Numerical Linear Algebra"
(or click here)
(at the NIPS 2013 Workshops, December 2013).
"Implementing Randomized Matrix Algorithms in Parallel and Distributed Environments"
(at the Simons Big Data Workshop II, October 2013).
"Inputsparsity Time Algorithms for Embeddings and Regression Problems"
(at the Simons Big Data Workshop I, September 2013).
"Past, Present and Future of Randomized Numerical Linear Algebra:
Part I (PD) and
Part II" (MM)
(at the Simons Big Data Bootcamp, September 2013).
"Randomized Regression in Parallel and Distributed Environments"
(at the GraphLab Workshop, July 2013).
"Sensors, networks, and massive data"
or click
here
(at Kavli FoS, November 2012).
"Implementing Randomized Matrix Algorithms in Parallel and Distributed Environments"
(at MMDS 2012, July 2012).
"Extracting insight from large networks: implications of smallscale and largescale structure"
(at the University of Marlyand, April 2012).
"Fast Approximation of Matrix Coherence and Statistical Leverage"
(at the NIPS Workshops, December 2011).
"Approximate computation and implicit regularization in largescale data analysis"
or click
here
(at the Workshop on Beyond WorstCase Analysis, September 2011).
"Linear Algebra and Machine Learning of Large Informatics Graphs"
(at the NIPS Workshops, December 2010).
"Geometric Tools for Identifying Structure in Large Social and Information Networks"
(90 minute version, tutorial at SAMSI Opening workshop, August 2010).
"Geometric Tools for Graph Mining of Large Social and Information Networks"
(3 hour version, tutorial at KDD 2010, July 2010).
"Community Structure in Large Social and Information Networks"
(at the Newton Institute in Cambridge, June 2010).
"Algorithmic and Statistical Perspectives on LargeScale Data Analysis"
(at the SF Bay Area DMSIG Meeting, February 2010).
"Community Structure in Large Social and Information Networks,"
in avi
or mpg (note: slow to download),
(at the 2009 IIT Kanpur Processing Massive Data Sets Workshop, December 2009).
"Statistical Leverage and Improved Matrix Algorithms"
(at the 2009 ICML Workshops, June 2009).
"Sampling Algorithms and Coresets for Lp Regression and Applications,"
in mpg (note: slow to download),
(at the 2006 IIT Kanpur Data Streams Workshop, December 2006).
"CUR Matrix Decompositions for Improved Data Analysis"
(at Johns Hopkins University, March 2006).
"Fast Monte Carlo Algorithms for Matrix Operations and Massive Data Set Analysis"
(at the 2005 IPAM Summer School, July 2005).
My old web page
at Yale has additional (somewhat outdated) information;
and my less old web page
at Stanford has additional (somewhat less outdated) information;
but you should click on them (or Google search for them) if you are interested in
extremely large homes.
