FFM 的作者Yu-Chin Juan在GitHub上开源了C++版本的代码libffm,由于日常的数据处理都是Python环境,所以期望能找到Python版本的FFM。相关的项目Github上有很多,比如这个:A Python wrapper for LibFFM。
目录
Windows+Anaconda环境下libffm的安装
libffm-python包的安装
该项目在Windows的安装方式为:
- 将项目下载到本地,并解压。
- 安装mingw32环境。conda install mingw32
- 在环境变量PATH中添加mingw32路径:C:\RBuildTools\3.5\mingw_32\bin
- 修改Python中的编译设置,D:\ProgramData\Anaconda3\Lib\distutils\distutils.cfg如果没有此文件则自己创建,添加内容为:
[build] compiler=mingw32
- 在项目目录中执行:python setup.py install
但在使用的时候,会报如下错误:
--------------------------------------------------------------------------- OSError Traceback (most recent call last) <ipython-input-2-244abf364e9b> in <module> ----> 1 import ffm D:\ProgramData\Anaconda3\lib\site-packages\ffm-7e8621d-py3.6-win-amd64.egg\ffm\__init__.py in <module> ----> 1 from .ffm import FFMData, FFM, read_model D:\ProgramData\Anaconda3\lib\site-packages\ffm-7e8621d-py3.6-win-amd64.egg\ffm\ffm.py in <module> 70 FFM_Problem_ptr = ctypes.POINTER(FFM_Problem) 71 ---> 72 _lib = ctypes.cdll.LoadLibrary(get_lib_path()) 73 74 _lib.ffm_convert_data.restype = FFM_Problem D:\ProgramData\Anaconda3\lib\ctypes\__init__.py in LoadLibrary(self, name) 424 425 def LoadLibrary(self, name): --> 426 return self._dlltype(name) 427 428 cdll = LibraryLoader(CDLL) D:\ProgramData\Anaconda3\lib\ctypes\__init__.py in __init__(self, name, mode, handle, use_errno, use_last_error) 346 347 if handle is None: --> 348 self._handle = _dlopen(self._name, mode) 349 else: 350 self._handle = handle OSError: [WinError 87] 参数错误。
主要原因是在Windows上进行安装的时候并没有编译生成libffm.so文件。安装失败。
Libffm在Windows上的编译
由于使用Python包时遇到问题,所以想着直接使用C++版本的代码进行编译。看了下项目介绍,只有v1.21版本的libffm才支持Windows环境:
Building Windows Binaries ========================= The Windows part is maintained by different maintainer, so it may not always support the latest version. The latest version it supports is: v1.21 To build them via command-line tools of Visual C++, use the following steps: 1. Open a DOS command box (or Developer Command Prompt for Visual Studio) and go to LIBFFM directory. If environment variables of VC++ have not been set, type "C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\bin\amd64\vcvars64.bat" You may have to modify the above command according which version of VC++ or where it is installed. 2. Type nmake -f Makefile.win clean all
按照上面的流程进行安装,遇到的第一个报错:无法找到“nmake”
nmake : 无法将“nmake”项识别为 cmdlet、函数、脚本文件或可运行程序的名称。请检查名称的拼写,如果包括路径,请确保路径正 确,然后再试一次。 所在位置 行:1 字符: 1 + nmake -f Makefile.win clean all + ~~~~~ + CategoryInfo : ObjectNotFound: (nmake:String) [], CommandNotFoundException + FullyQualifiedErrorId : CommandNotFoundException
初步解决方案为将“nmake”所在目录添加到环境变量PATH中。然而,执行后还是会报错,这次报错的主要把内容是无法加载到引用的文件:
PS E:\Download\libffm-121> nmake -f Makefile.win clean all Microsoft (R) Program Maintenance Utility Version 14.00.24210.0 Copyright (C) Microsoft Corporation. All rights reserved. erase /Q *.obj *.exe windows\. rd windows mkdir windows cl.exe /nologo /O2 /EHsc /D "_CRT_SECURE_NO_DEPRECATE" /D "USEOMP" /D "USESSE" /openmp -c ffm.cpp ffm.cpp ffm.cpp(21): warning C4068: unknown pragma ffm.cpp(22): fatal error C1034: algorithm: no include path set NMAKE : fatal error U1077: '"C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\bin\cl.exe"' : return code '0x2' Stop.
网上搜索了下,发现VC++设置环境变量的水还是比较深的,需要添加PATH、LIB和INCLUDE这三个环境变量。主要的原因是VS2015里面加入了ucrt这个东西,所以需要额外引入Windows 10的SDK,还有uuid.lib得在Windows 8.x的SDK里找到,所以配置起来还是蛮麻烦的。
- PATH C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\bin;C:\Program Files (x86)\Microsoft Visual Studio 14.0\Common7\IDE
- LIB C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\lib;C:\Program Files (x86)\Windows Kits\10\Lib\10.0.10240.0\ucrt\x86;C:\Program Files (x86)\Windows Kits\8.1\Lib\winv6.3\um\x86
- INCLUDE C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\include;C:\Program Files (x86)\Windows Kits\10\Include\10.0.10240.0\ucrt
具体路径按照自己安装的位置进行相应的调整。完成后再次执行即可成功编译。如下,只出现了一些警告信息:
PS E:\Download\libffm-121> nmake -f Makefile.win clean all Microsoft (R) Program Maintenance Utility Version 14.00.24210.0 Copyright (C) Microsoft Corporation. All rights reserved. erase /Q *.obj *.exe windows\. rd windows mkdir windows cl.exe /nologo /O2 /EHsc /D "_CRT_SECURE_NO_DEPRECATE" /D "USEOMP" /D "USESSE" /openmp -c ffm.cpp ffm.cpp ffm.cpp(21): warning C4068: unknown pragma cl.exe /nologo /O2 /EHsc /D "_CRT_SECURE_NO_DEPRECATE" /D "USEOMP" /D "USESSE" /openmp -c timer.cpp timer.cpp cl.exe /nologo /O2 /EHsc /D "_CRT_SECURE_NO_DEPRECATE" /D "USEOMP" /D "USESSE" /openmp ffm-train.cpp ffm.obj timer.obj -Fewindows\ffm-train.exe ffm-train.cpp ffm-train.cpp(1): warning C4068: unknown pragma cl.exe /nologo /O2 /EHsc /D "_CRT_SECURE_NO_DEPRECATE" /D "USEOMP" /D "USESSE" /openmp ffm-predict.cpp ffm.obj timer.obj -Fewindows\ffm-predict.exe ffm-predict.cpp
编译完成后会在源文件文件夹下新建一个windows的文件夹,并生成2个exe文件:
- ffm-predict.exe
- ffm-train.exe
ffm-train.exe与ffm-predict.exe的使用
比较简单的方法时在命令行直接调用,使用方法如项目文档中所述:
Command Line Usage ================== - `ffm-train' usage: ffm-train [options] training_set_file [model_file] options: -l <lambda>: set regularization parameter (default 0.00002) -k <factor>: set number of latent factors (default 4) -t <iteration>: set number of iterations (default 15) -r <eta>: set learning rate (default 0.2) -s <nr_threads>: set number of threads (default 1) -p <path>: set path to the validation set --quiet: quiet model (no output) --no-norm: disable instance-wise normalization --auto-stop: stop at the iteration that achieves the best validation loss (must be used with -p) By default we do instance-wise normalization. That is, we normalize the 2-norm of each instance to 1. You can use `--no-norm' to disable this function. A binary file `training_set_file.bin' will be generated to store the data in binary format. Because FFM usually need early stopping for better test performance, we provide an option `--auto-stop' to stop at the iteration that achieves the best validation loss. Note that you need to provide a validation set with `-p' when you use this option. - `ffm-predict' usage: ffm-predict test_file model_file output_file
另外也可通过Python调用命令行的方式来使用:
import os import subprocess os.getcwd() os.chdir(r'E:\Download\libffm-121\windows') os.getcwd() os.system("start ffm-train.exe") os.startfile("ffm-train.exe") os.system("start ffm-predict.exe") os.startfile("ffm-predict.exe") #使用缺省参数训练模型 cmd = 'ffm-train bigdata.tr.txt model' subprocess.call(cmd, shell=True) #使用bigdata.te.txt作为validation数据 cmd = 'ffm-train -p bigdata.te.txt bigdata.tr.txt model' subprocess.call(cmd, shell=True) #使用5折交叉验证 cmd = 'ffm-train -v 5 bigdata.tr.txt' subprocess.call(cmd, shell=True) #用–quiet参数训练时不打印训练信息 cmd = 'ffm-train –quiet bigdata.tr.txt' subprocess.call(cmd, shell=True) #预测 cmd = 'ffm-predict bigdata.te.txt model output.txt' subprocess.call(cmd, shell=True) #基于磁盘的训练 cmd = 'ffm-train –no-rand –on-disk bigdata.tr.txt' subprocess.call(cmd, shell=True) #使用–auto-stop参数,当达到最优的validation损失时停止训练 cmd = 'ffm-train -p bigdata.te.txt -t 100 bigdata.tr.txt' subprocess.call(cmd, shell=True)
示例代码所用到的训练文件地址为:https://github.com/keyunluo/python-ffm/tree/master/example/libffm-format
如上调用非常的麻烦,我另外找到了一个开源的项目对其进行了进一步封装:https://github.com/gatapia/py_ml_utils,封装的代码为:
from __future__ import print_function, absolute_import import os, sys, subprocess, shlex, tempfile, time, sklearn.base, math import numpy as np import pandas as pd from pandas_extensions import * from ExeEstimator import * class LibFFMClassifier(ExeEstimator, sklearn.base.ClassifierMixin): ''' options: -l <lambda>: set regularization parameter (default 0) -k <factor>: set number of latent factors (default 4) -t <iteration>: set number of iterations (default 15) -r <eta>: set learning rate (default 0.1) -s <nr_threads>: set number of threads (default 1) -p <path>: set path to the validation set --quiet: quiet model (no output) --norm: do instance-wise normalization --no-rand: disable random update `--norm' helps you to do instance-wise normalization. When it is enabled, you can simply assign `1' to `value' in the data. ''' def __init__(self, columns, lambda_v=0, factor=4, iteration=15, eta=0.1, nr_threads=1, quiet=False, normalize=None, no_rand=None): ExeEstimator.__init__(self) self.columns = columns.tolist() if hasattr(columns, 'tolist') else columns self.lambda_v = lambda_v self.factor = factor self.iteration = iteration self.eta = eta self.nr_threads = nr_threads self.quiet = quiet self.normalize = normalize self.no_rand = no_rand def fit(self, X, y=None): if type(X) is str: train_file = X else: if not hasattr(X, 'values'): X = pd.DataFrame(X, columns=self.columns) train_file = self.save_reusable('_libffm_train', 'to_libffm', X, y) # self._model_file = self.save_tmp_file(X, '_libffm_model', True) self._model_file = self.tmpfile('_libffm_model') command = 'utils/lib/ffm-train.exe' + ' -l ' + repr(v) + \ ' -k ' + repr(r) + ' -t ' + repr(n) + ' -r ' + repr(a) + \ ' -s ' + repr(s) if self.quiet: command += ' --quiet' if self.normalize: command += ' --norm' if self.no_rand: command += ' --no-rand' command += ' ' + train_file command += ' ' + self._model_file running_process = self.make_subprocess(command) self.close_process(running_process) return self def predict(self, X): if type(X) is str: test_file = X else: if not hasattr(X, 'values'): X = pd.DataFrame(X, columns=self.columns) test_file = self.save_reusable('_libffm_test', 'to_libffm', X) output_file = self.tmpfile('_libffm_predictions') command = 'utils/lib/ffm-predict.exe ' + test_file + ' ' + self._model_file + ' ' + output_file running_process = self.make_subprocess(command) self.close_process(running_process) preds = list(self.read_predictions(output_file)) return preds def predict_proba(self, X): predictions = np.asarray(map(lambda p: 1 / (1 + math.exp(-p)), self.predict(X))) return np.vstack([1 - predictions, predictions]).T
总结,在Windows环境下使用libffm非常的困难,不管是编译还是调用,如果环境许可,建议还是在Linux环境下使用。
Linux+Anaconda环境下libffm的安装
Linux环境下的Anaconda中安装libffm-python包同样出现了问题。具体报错内容如下:
➜ libffm-python git:(master) python setup.py install /home/qw/anaconda3/lib/python3.7/site-packages/setuptools/dist.py:481: UserWarning: The version specified ('7e8621d') is an invalid version, this may not work as expected with newer versions of setuptools, pip, and PyPI. Please see PEP 440 for more details. "details." % self.metadata.version running install running bdist_egg running egg_info creating ffm.egg-info writing ffm.egg-info/PKG-INFO writing dependency_links to ffm.egg-info/dependency_links.txt writing requirements to ffm.egg-info/requires.txt writing top-level names to ffm.egg-info/top_level.txt writing manifest file 'ffm.egg-info/SOURCES.txt' reading manifest file 'ffm.egg-info/SOURCES.txt' writing manifest file 'ffm.egg-info/SOURCES.txt' installing library code to build/bdist.linux-x86_64/egg running install_lib running build_py creating build creating build/lib.linux-x86_64-3.7 creating build/lib.linux-x86_64-3.7/ffm copying ffm/__init__.py -> build/lib.linux-x86_64-3.7/ffm copying ffm/ffm.py -> build/lib.linux-x86_64-3.7/ffm running build_ext building 'ffm.libffm' extension creating build/temp.linux-x86_64-3.7 gcc -pthread -B /home/qw/anaconda3/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I. -I/home/qw/anaconda3/include/python3.7m -c ffm.cpp -o build/temp.linux-x86_64-3.7/ffm.o -Wall -O3 -std=c++0x -march=native -DUSESSE -DUSEOMP cc1plus: 警告:command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++ ffm.cpp:578: 警告:忽略 #pragma omp parallel [-Wunknown-pragmas] 578 | #pragma omp parallel for schedule(static) reduction(+: loss) | ffm.cpp:726: 警告:忽略 #pragma omp parallel [-Wunknown-pragmas] 726 | #pragma omp parallel for schedule(static) reduction(+: loss) | gcc -pthread -B /home/qw/anaconda3/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I. -I/home/qw/anaconda3/include/python3.7m -c timer.cpp -o build/temp.linux-x86_64-3.7/timer.o -Wall -O3 -std=c++0x -march=native -DUSESSE -DUSEOMP cc1plus: 警告:command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++ g++ -pthread -shared -B /home/qw/anaconda3/compiler_compat -L/home/qw/anaconda3/lib -Wl,-rpath=/home/qw/anaconda3/lib -Wl,--no-as-needed -Wl,--sysroot=/ build/temp.linux-x86_64-3.7/ffm.o build/temp.linux-x86_64-3.7/timer.o -o build/lib.linux-x86_64-3.7/ffm/libffm.cpython-37m-x86_64-linux-gnu.so -fopenmp /home/qw/anaconda3/compiler_compat/ld: build/temp.linux-x86_64-3.7/ffm.o: unable to initialize decompress status for section .debug_info /home/qw/anaconda3/compiler_compat/ld: build/temp.linux-x86_64-3.7/ffm.o: unable to initialize decompress status for section .debug_info /home/qw/anaconda3/compiler_compat/ld: build/temp.linux-x86_64-3.7/ffm.o: unable to initialize decompress status for section .debug_info /home/qw/anaconda3/compiler_compat/ld: build/temp.linux-x86_64-3.7/ffm.o: unable to initialize decompress status for section .debug_info build/temp.linux-x86_64-3.7/ffm.o: file not recognized: file format not recognized collect2: 错误:ld 返回 1 error: command 'g++' failed with exit status 1
刚开始以为libffm的代码存在了问题,先用线上最新版进行了替换,发现还是会报错。于是又检查了代码,发现代码并没有问题,并且可以在非Anaconda环境下正常编译。仔细检查了下,发现问题出在Anaconda。Anaconda自带了一个连接器ld,位置存放在~/anaconda3/compiler_compat目录下,解决方案非常简单,将~/anaconda3/compiler_compat目录下的ld改个名字后再安装即可。