在实际的应用场景,通常会遇到计算多个经纬度中心的需求。而在计算经纬度中心点通常有三种方式,每种方式对应不同的需求。
地理中心点
地理中心点的求解过程非常的简单,即将每个经纬度转化成x,y,z的坐标值。然后根据根据x,y,z的值,寻找3D坐标系中的中心点。
from math import cos, sin, atan2, sqrt, radians, degrees def get_centroid(cluster): x = y = z = 0 coord_num = len(cluster) for coord in cluster: lat, lon = radians(coord[0]), radians(coord[1]) x += cos(lat) * cos(lon) y += cos(lat) * sin(lon) z += sin(lat) x /= coord_num y /= coord_num z /= coord_num return [degrees(atan2(y, x)), degrees(atan2(z, sqrt(x * x + y * y)))]
平均经纬度
所谓的平均经纬度是将经纬度坐标看成是平面坐标,直接计算经度和纬度的平均值。注意:该方法只是大致的估算方法,仅适合距离在400KM以内的点。
from math import pi def get_geo_mid(data): x = y = 0 coord_num = len(data) for coord in data: lat = coord[0] lon = coord[1] x += lat y += lon x /= coord_num y /= coord_num return lat, lon
最小距离点
所谓的最小距离点,表示的是如何给出的点中哪一点到各个点的距离最近,常用于路径相关的场景。比较简单的实现方式是使用K-Means,并将K值设为1。注意,Scikit Learn中自带的Kmeans默认是欧式距离,不支持自定义。解决方法是自己实现:
from math import radians, sin, cos, asin, sqrt import numpy as np import pandas as pd def haversine(latlon1, latlon2): """ 计算两经纬度之间的距离 """ if (latlon1 - latlon2).all(): lat1, lon1 = latlon1 lat2, lon2 = latlon2 lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2]) dlon = lon2 - lon1 dlat = lat2 - lat1 a = sin(dlat / 2) ** 2 + cos(lat1) * cos(lat2) * sin(dlon / 2) ** 2 c = 2 * asin(sqrt(a)) r = 6370996.81 # 地球半径 distance = c * r else: distance = 0 return distance # # KMeans算法实现 - 开始 def haversine_distance_matrix(X, Y=None): """Harversine distance matrix calculation""" if Y is None: Y = X return np.apply_along_axis(lambda a, b: np.apply_along_axis(haversine, 1, b, a), 1, X[:, [1, 0]], Y[:, [1, 0]]) def initialize_centroids(points, k): """returns k centroids from the initial points""" centroids = points.copy() np.random.shuffle(centroids) return centroids[:k] def move_centroids(points, closest, centroids): """returns the new centroids assigned from the points closest to them""" new_centroids = [points[closest == k].mean(axis=0) for k in range(centroids.shape[0])] for i, c in enumerate(new_centroids): if np.isnan(c).any(): new_centroids[i] = centroids[i] return np.array(new_centroids) def closest_centroid_haversine(points, centroids): """returns an array containing the index to the nearest centroid for each point """ distances = haversine_distance_matrix(centroids, points) return np.argmin(distances, axis=0) def clustering_by_kmeams(df, n_clusters=1, max_iter=300): """ KMeans聚类算法入口 :param X: :return: """ X = df[['lon', 'lat']].as_matrix() centroids = initialize_centroids(X, n_clusters) old_centroids = centroids i = 0 while i < max_iter: i += 1 # print("Iteration #{0:d}".format(i)) closest = closest_centroid_haversine(X, centroids) centroids = move_centroids(X, closest, centroids) done = np.all(np.isclose(old_centroids, centroids)) if done: break old_centroids = centroids cdf = pd.DataFrame(centroids, columns=['lon', 'lat']) # k_means_labels = closest_centroid_haversine(X, centroids) # df['assigned_points'] = k_means_labels + 1 return cdf
参考链接:
有个问题kmeans取一类,中心点不就是求算数平均值吗?第三种方法的意义在哪里?