This document discusses various methods for calculating Wasserstein distance between probability distributions, including:
- Sliced Wasserstein distance, which projects distributions onto lower-dimensional spaces to enable efficient 1D optimal transport calculations.
- Max-sliced Wasserstein distance, which focuses sampling on the most informative projection directions.
- Generalized sliced Wasserstein distance, which uses more flexible projection functions than simple slicing, like the Radon transform.
- Augmented sliced Wasserstein distance, which applies a learned transformation to distributions before projecting, allowing more expressive matching between distributions.
These sliced/generalized Wasserstein distances have been used as loss functions for generative models with promising
This document summarizes a research paper on scaling laws for neural language models. Some key findings of the paper include:
- Language model performance depends strongly on model scale and weakly on model shape. With enough compute and data, performance scales as a power law of parameters, compute, and data.
- Overfitting is universal, with penalties depending on the ratio of parameters to data.
- Large models have higher sample efficiency and can reach the same performance levels with less optimization steps and data points.
- The paper motivated subsequent work by OpenAI on applying scaling laws to other domains like computer vision and developing increasingly large language models like GPT-3.
The document discusses deep kernel learning, which combines deep learning and Gaussian processes (GPs). It briefly reviews the predictive equations and marginal likelihood for GPs, noting their computational requirements. GPs assume datasets with input vectors and target values, modeling the values as joint Gaussian distributions based on a mean function and covariance kernel. Predictive distributions for test points are also Gaussian. The goal of deep kernel learning is to leverage recent work on efficiently representing kernel functions to produce scalable deep kernels, allowing outperformance of standalone deep learning and GPs on various datasets.
This document summarizes a research paper on scaling laws for neural language models. Some key findings of the paper include:
- Language model performance depends strongly on model scale and weakly on model shape. With enough compute and data, performance scales as a power law of parameters, compute, and data.
- Overfitting is universal, with penalties depending on the ratio of parameters to data.
- Large models have higher sample efficiency and can reach the same performance levels with less optimization steps and data points.
- The paper motivated subsequent work by OpenAI on applying scaling laws to other domains like computer vision and developing increasingly large language models like GPT-3.
The document discusses deep kernel learning, which combines deep learning and Gaussian processes (GPs). It briefly reviews the predictive equations and marginal likelihood for GPs, noting their computational requirements. GPs assume datasets with input vectors and target values, modeling the values as joint Gaussian distributions based on a mean function and covariance kernel. Predictive distributions for test points are also Gaussian. The goal of deep kernel learning is to leverage recent work on efficiently representing kernel functions to produce scalable deep kernels, allowing outperformance of standalone deep learning and GPs on various datasets.