Omics - Hunter

UMAP-(uniform manifold approximation and projection)均匀流形近似和投影

说到UMAP,不得不提一下t-SNE。UMAP是类似于t-SNE的一种数据降维算法

UMAP优点,简要来说有以下两点:

  • 非线性降维,速度快(比t-SNE快很多,尤其在高维变量)
  • 支持更多的变量/更高维度
umap在各个数据集下的表现

下面我们也介绍下UMAP的R包使用方法:

step1.安装UMAP包

install.packages("uwot")

step2.一个示例

library(uwot)

# See function man page for help
?umap

# Non-numeric columns are ignored, so in a lot of cases you can pass a data
# frame directly to umap
iris_umap <- umap(iris, n_neighbors = 50, learning_rate = 0.5, init = "random")

# Load mnist from somewhere, e.g.
# devtools::install_github("jlmelville/snedata")
# mnist <- snedata::download_mnist()
mnist_umap <- umap(mnist, n_neighbors = 15, min_dist = 0.001, verbose = TRUE)

# For high dimensional datasets (> 100-1000 columns) using PCA to reduce 
# dimensionality is highly recommended to avoid the nearest neighbor search 
# taking a long time. Keeping only 50 dimensions can speed up calculations 
# without affecting the visualization much
mnist_umap <- umap(mnist, pca = 50)

# Use a specific number of threads
mnist_umap <- umap(mnist, n_neighbors = 15, min_dist = 0.001, verbose = TRUE, n_threads = 8)

# Use a different metric
mnist_umap_cosine <- umap(mnist, n_neighbors = 15, metric = "cosine", min_dist = 0.001, verbose = TRUE, n_threads = 8)

# If you are only interested in visualization, `fast_sgd = TRUE` gives a much faster optimization
mnist_umap_fast_sgd <- umap(mnist, n_neighbors = 15, metric = "cosine", min_dist = 0.001, verbose = TRUE, fast_sgd = TRUE)

# Supervised dimension reduction
mnist_umap_s <- umap(mnist, n_neighbors = 15, min_dist = 0.001, verbose = TRUE, n_threads = 8, 
                     y = mnist$Label, target_weight = 0.5)

# Add new points to an existing embedding
mnist_train <- head(mnist, 60000)
mnist_test <- tail(mnist, 10000)

# You must set ret_model = TRUE to return extra data we need
# coordinates are in mnist_train_umap$embedding
mnist_train_umap <- umap(mnist_train, verbose = TRUE, ret_model = TRUE)
mnist_test_umap <- umap_transform(mnist_test, mnist_train_umap, verbose = TRUE)

# Save the nearest neighbor data
mnist_nn <- umap(mnist, ret_nn = TRUE)
# coordinates are now in mnist_nn$embedding

# Re-use the nearest neighor data and save a lot of time
mnist_nn_spca <- umap(mnist, nn_method = mnist_nn$nn, init = spca)

# No problem to have ret_nn = TRUE and ret_model = TRUE at the same time

# Calculate Petal and Sepal neighbors separately (uses intersection of the resulting sets):
iris_umap <- umap(iris, metric = list("euclidean" = c("Sepal.Length", "Sepal.Width"),
                                      "euclidean" = c("Petal.Length", "Petal.Width")))
# Can also use individual factor columns
iris_umap <- umap(iris, metric = list("euclidean" = c("Sepal.Length", "Sepal.Width"),
                                      "euclidean" = c("Petal.Length", "Petal.Width"),
                                      "categorical" = "Species"))

step3.绘图

其降维分类效果比t-SNE更好,分的更开。

umap函数介绍

umap(X, n_neighbors = 15, n_components = 2, metric = "euclidean",
  n_epochs = NULL, learning_rate = 1, scale = FALSE,
  init = "spectral", init_sdev = NULL, spread = 1, min_dist = 0.01,
  set_op_mix_ratio = 1, local_connectivity = 1, bandwidth = 1,
  repulsion_strength = 1, negative_sample_rate = 5, a = NULL,
  b = NULL, nn_method = NULL, n_trees = 50, search_k = 2 *
  n_neighbors * n_trees, approx_pow = FALSE, y = NULL,
  target_n_neighbors = n_neighbors, target_metric = "euclidean",
  target_weight = 0.5, pca = NULL, pca_center = TRUE,
  pcg_rand = TRUE, fast_sgd = FALSE, ret_model = FALSE,
  ret_nn = FALSE, n_threads = max(1,
  RcppParallel::defaultNumThreads()/2), n_sgd_threads = 0,
  grain_size = 1, tmpdir = tempdir(), verbose = getOption("verbose",
  TRUE))

n_neighbors:确定相邻点的数量,通常其设置在2-100之间。

n_components:降维的维数大小,默认是2,其范围最好也在2-100之间。

Metric:距离的计算方法,有很多可以选择,具体的需要我们在应用的时候自行筛选。如:euclidean,manhattan,chebyshev,minkowski,canberra,braycurtis,mahalanobis,wminkowski,seuclidean,cosine,correlation,haversine,hamming,jaccard,dice,russelrao,kulsinski,rogerstanimoto,sokalmichener,sokalsneath,yule。

n_epochs:模型训练迭代次数。数据量大时200,小时500。

input:数据的类型,如果是data就会按照数据进行计算;如果dist就会认为是距离矩阵进行训练。

init:初始化用的。其中有这么三种方式: spectral,random,自定义。

min_dist:控制允许嵌入的紧密程度,值越小点越聚集,默认一般是0.1。

set_op_mix_ratio:设置降维过程中,各特征的结合方式,值0-1。0代表取交集,1代表取合集;中间就是比例。

local_connectivity:局部连接的点之间值,默认1,其值越大局部连接越多,导致的结果就是超越固有的流形维数出现改变。

bandwith:用于构造子集参数。

alpha:相当于在python中的leanging_rate(学习率)参数。

gamma:布局最优的学习率

negative_sample_rate:每一个阳性样本导致的阴性率。其值越大导致高的优化也就是过拟合,预测准确度下降。默认是5

spread:有效的嵌入式降维范围。与min_dist联合使用。

random_state:此值主要是确保模型的可重复性。如果不设置基于np.random,每次将会不同。

transform_seed:此值用于数值转换操作。一般默认42。

verbose: 控制工作日志,防止存储过多

参考资料:

1.https://cran.r-project.org/web/packages/umap/index.html

2.https://github.com/tkonopka/umap

3.https://arxiv.org/abs/1802.03426

4.https://github.com/lmcinnes/umap

5.https://blog.csdn.net/qq_36810544/article/details/81094469

6.https://github.com/jlmelville/uwot


作者:陈浩


版权:本文版权归作者所有


免责声明:本文中使用的部分图片来自于网络或者参考资料,如有侵权,请联系博主:chenhao__@__evvail.com(发件请删除下划线)进行删除


转载注意:除非特别声明,本站点内容均为作者原创文章,转载须以链接形式标明本文链接


本文链接:https://evvail.com/2019/08/18/78.html

1 评论

  1. 更多详细细节,可以参考文末的文献和手册。

发表回复

如果你有什么好的建议或者疑问请给我留言,谢谢!

Captcha Code