器→工具, 工具软件

FFM/libffm在Windows上的使用

钱魏Way · · 244 次浏览

FFM 的作者Yu-Chin Juan在GitHub上开源了C++版本的代码libffm,由于日常的数据处理都是Python环境,所以期望能找到Python版本的FFM。相关的项目Github上有很多,比如这个:A Python wrapper for LibFFM

该项目在Windows的安装方式为:

  • 将项目下载到本地,并解压。
  • 安装mingw32环境。conda install mingw32
  • 在环境变量PATH中添加mingw32路径:C:\RBuildTools\3.5\mingw_32\bin
  • 修改Python中的编译设置,D:\ProgramData\Anaconda3\Lib\distutils\ cfg如果没有此文件则自己创建,添加内容为:
[build]
compiler=mingw32
  • 在项目目录中执行:python setup.py install

但在使用的时候,会报如下错误。主要原因是在Windows上进行安装的时候并没有编译生成libffm.so文件,其他类似的项目均如此。

---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-2-244abf364e9b> in <module>
----> 1 import ffm

D:\ProgramData\Anaconda3\lib\site-packages\ffm-7e8621d-py3.6-win-amd64.egg\ffm\__init__.py in <module>
----> 1 from .ffm import FFMData, FFM, read_model

D:\ProgramData\Anaconda3\lib\site-packages\ffm-7e8621d-py3.6-win-amd64.egg\ffm\ffm.py in <module>
     70 FFM_Problem_ptr = ctypes.POINTER(FFM_Problem)
     71 
---> 72 _lib = ctypes.cdll.LoadLibrary(get_lib_path())
     73 
     74 _lib.ffm_convert_data.restype = FFM_Problem

D:\ProgramData\Anaconda3\lib\ctypes\__init__.py in LoadLibrary(self, name)
    424 
    425     def LoadLibrary(self, name):
--> 426         return self._dlltype(name)
    427 
    428 cdll = LibraryLoader(CDLL)

D:\ProgramData\Anaconda3\lib\ctypes\__init__.py in __init__(self, name, mode, handle, use_errno, use_last_error)
    346 
    347         if handle is None:
--> 348             self._handle = _dlopen(self._name, mode)
    349         else:
    350             self._handle = handle

OSError: [WinError 87] 参数错误。

Libffm在Windows上的编译

由于使用Python包时遇到问题,所以想着直接使用C++版本的代码进行编译。看了下项目介绍,只有v1.21版本的libffm才支持Windows环境:

Building Windows Binaries
=========================

The Windows part is maintained by different maintainer, so it may not always support the latest version.

The latest version it supports is: v1.21

To build them via command-line tools of Visual C++, use the following steps:

1. Open a DOS command box (or Developer Command Prompt for Visual Studio) and go to LIBFFM directory. If environment
variables of VC++ have not been set, type

"C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\bin\amd64\vcvars64.bat"

You may have to modify the above command according which version of VC++ or
where it is installed.

2. Type

nmake -f Makefile.win clean all

我按照上面的流程进行安装,遇到的第一个报错:无法找到“nmake”

nmake : 无法将“nmake”项识别为 cmdlet、函数、脚本文件或可运行程序的名称。请检查名称的拼写,如果包括路径,请确保路径正
确,然后再试一次。
所在位置 行:1 字符: 1
+ nmake -f Makefile.win clean all
+ ~~~~~
    + CategoryInfo          : ObjectNotFound: (nmake:String) [], CommandNotFoundException
    + FullyQualifiedErrorId : CommandNotFoundException

我能想到大初步解决方案为将“nmake”所在目录添加到环境变量PATH中。然而,执行后还是会报错,这次报错的主要把内容是无法加载到引用的文件:

PS E:\Download\libffm-121> nmake -f Makefile.win clean all

Microsoft (R) Program Maintenance Utility Version 14.00.24210.0
Copyright (C) Microsoft Corporation.  All rights reserved.

        erase /Q *.obj *.exe windows\.
        rd windows
        mkdir windows
        cl.exe /nologo /O2 /EHsc /D "_CRT_SECURE_NO_DEPRECATE" /D "USEOMP" /D "USESSE" /openmp -c ffm.cpp
ffm.cpp
ffm.cpp(21): warning C4068: unknown pragma
ffm.cpp(22): fatal error C1034: algorithm: no include path set
NMAKE : fatal error U1077: '"C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\bin\cl.exe"' : return code '0x2'
Stop.

网上搜索了下,发现VC++设置环境变量的水还是比较深的,需要添加PATH、LIB和INCLUDE这三个环境变量。主要的原因是VS2015里面加入了ucrt这个东西,所以需要额外引入Windows 10的SDK,还有uuid.lib得在Windows 8.x的SDK里找到,所以配置起来还是蛮麻烦的。

  • PATH  C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\bin;C:\Program Files (x86)\Microsoft Visual Studio 14.0\Common7\IDE
  • LIB  C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\lib;C:\Program Files (x86)\Windows Kits\10\Lib\10.0.10240.0\ucrt\x86;C:\Program Files (x86)\Windows Kits\8.1\Lib\winv6.3\um\x86
  • INCLUDE  C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\include;C:\Program Files (x86)\Windows Kits\10\Include\10.0.10240.0\ucrt

具体路径按照自己安装的位置进行相应的调整。完成后再次执行即可成功编译。如下,只出现了一些警告信息:

PS E:\Download\libffm-121> nmake -f Makefile.win clean all

Microsoft (R) Program Maintenance Utility Version 14.00.24210.0
Copyright (C) Microsoft Corporation.  All rights reserved.

        erase /Q *.obj *.exe windows\.
        rd windows
        mkdir windows
        cl.exe /nologo /O2 /EHsc /D "_CRT_SECURE_NO_DEPRECATE" /D "USEOMP" /D "USESSE" /openmp -c ffm.cpp
ffm.cpp
ffm.cpp(21): warning C4068: unknown pragma
        cl.exe /nologo /O2 /EHsc /D "_CRT_SECURE_NO_DEPRECATE" /D "USEOMP" /D "USESSE" /openmp -c timer.cpp
timer.cpp
        cl.exe /nologo /O2 /EHsc /D "_CRT_SECURE_NO_DEPRECATE" /D "USEOMP" /D "USESSE" /openmp ffm-train.cpp ffm.obj timer.obj -Fewindows\ffm-train.exe
ffm-train.cpp
ffm-train.cpp(1): warning C4068: unknown pragma
        cl.exe /nologo /O2 /EHsc /D "_CRT_SECURE_NO_DEPRECATE" /D "USEOMP" /D "USESSE" /openmp ffm-predict.cpp ffm.obj timer.obj -Fewindows\ffm-predict.exe
ffm-predict.cpp

编译完成后会在源文件文件夹下新建一个windows的文件夹,并生成2个exe文件:

  • ffm-predict.exe
  • ffm-train.exe

ffm-train.exe与ffm-predict.exe的使用

比较简单的方法时在命令行直接调用,使用方法如项目文档中所述:

Command Line Usage
==================

-   `ffm-train'

    usage: ffm-train [options] training_set_file [model_file]

    options:
    -l <lambda>: set regularization parameter (default 0.00002)
    -k <factor>: set number of latent factors (default 4)
    -t <iteration>: set number of iterations (default 15)
    -r <eta>: set learning rate (default 0.2)
    -s <nr_threads>: set number of threads (default 1)
    -p <path>: set path to the validation set
    --quiet: quiet model (no output)
    --no-norm: disable instance-wise normalization
    --auto-stop: stop at the iteration that achieves the best validation loss (must be used with -p)

    By default we do instance-wise normalization. That is, we normalize the 2-norm of each instance to 1. You can use
    `--no-norm' to disable this function.
    
    A binary file `training_set_file.bin' will be generated to store the data in binary format.

    Because FFM usually need early stopping for better test performance, we provide an option `--auto-stop' to stop at
    the iteration that achieves the best validation loss. Note that you need to provide a validation set with `-p' when
    you use this option.


-   `ffm-predict'

    usage: ffm-predict test_file model_file output_file

另外也可通过Python调用命令行的方式来使用:

import os
import subprocess

os.getcwd()
os.chdir(r'E:\Download\libffm-121\windows')

os.getcwd()
os.system("start ffm-train.exe")
os.startfile("ffm-train.exe")
os.system("start ffm-predict.exe")
os.startfile("ffm-predict.exe")

#使用缺省参数训练模型
cmd = 'ffm-train bigdata.tr.txt model'
subprocess.call(cmd, shell=True)

#使用bigdata.te.txt作为validation数据
cmd = 'ffm-train -p bigdata.te.txt bigdata.tr.txt model'
subprocess.call(cmd, shell=True)

#使用5折交叉验证
cmd = 'ffm-train -v 5 bigdata.tr.txt'
subprocess.call(cmd, shell=True)

#用–quiet参数训练时不打印训练信息
cmd = 'ffm-train –quiet bigdata.tr.txt'
subprocess.call(cmd, shell=True)

#预测
cmd = 'ffm-predict bigdata.te.txt model output.txt'
subprocess.call(cmd, shell=True)

#基于磁盘的训练
cmd = 'ffm-train –no-rand –on-disk bigdata.tr.txt'
subprocess.call(cmd, shell=True)

#使用–auto-stop参数,当达到最优的validation损失时停止训练
cmd = 'ffm-train -p bigdata.te.txt -t 100 bigdata.tr.txt'
subprocess.call(cmd, shell=True)

示例代码所用到的训练文件地址为:https://github.com/keyunluo/python-ffm/tree/master/example/libffm-format

如上调用非常的麻烦,我另外找到了一个开源的项目对其进行了进一步封装:https://github.com/gatapia/py_ml_utils,封装的代码为:

from __future__ import print_function, absolute_import

import os, sys, subprocess, shlex, tempfile, time, sklearn.base, math
import numpy as np
import pandas as pd
from pandas_extensions import * 
from ExeEstimator import *

class LibFFMClassifier(ExeEstimator, sklearn.base.ClassifierMixin):
  '''
  options:
  -l <lambda>: set regularization parameter (default 0)
  -k <factor>: set number of latent factors (default 4)
  -t <iteration>: set number of iterations (default 15)
  -r <eta>: set learning rate (default 0.1)
  -s <nr_threads>: set number of threads (default 1)
  -p <path>: set path to the validation set
  --quiet: quiet model (no output)
  --norm: do instance-wise normalization
  --no-rand: disable random update
  `--norm' helps you to do instance-wise normalization. When it is enabled,
  you can simply assign `1' to `value' in the data.
  '''
  def __init__(self, columns, lambda_v=0, factor=4, iteration=15, eta=0.1, 
    nr_threads=1, quiet=False, normalize=None, no_rand=None):
    ExeEstimator.__init__(self)
    
    self.columns = columns.tolist() if hasattr(columns, 'tolist') else columns
    self.lambda_v = lambda_v
    self.factor = factor
    self.iteration = iteration
    self.eta = eta
    self.nr_threads = nr_threads
    self.quiet = quiet
    self.normalize = normalize
    self.no_rand = no_rand

  def fit(self, X, y=None):
    if type(X) is str: train_file = X
    else: 
      if not hasattr(X, 'values'): X = pd.DataFrame(X, columns=self.columns)
      train_file = self.save_reusable('_libffm_train', 'to_libffm', X, y)
      
    # self._model_file = self.save_tmp_file(X, '_libffm_model', True)
    self._model_file = self.tmpfile('_libffm_model')

    command = 'utils/lib/ffm-train.exe' + ' -l ' + repr(v) + \
      ' -k ' + repr(r) + ' -t ' + repr(n) + ' -r ' + repr(a) + \
      ' -s ' + repr(s)
    if self.quiet: command += ' --quiet'
    if self.normalize: command += ' --norm'
    if self.no_rand: command += ' --no-rand'  
    command += ' ' + train_file
    command += ' ' + self._model_file
    running_process = self.make_subprocess(command)
    self.close_process(running_process)
    return self

  def predict(self, X):  
    if type(X) is str: test_file = X
    else: 
      if not hasattr(X, 'values'): X = pd.DataFrame(X, columns=self.columns)
      test_file = self.save_reusable('_libffm_test', 'to_libffm', X)

    output_file = self.tmpfile('_libffm_predictions')

    command = 'utils/lib/ffm-predict.exe ' + test_file + ' ' + self._model_file + ' ' + output_file
    running_process = self.make_subprocess(command)
    self.close_process(running_process)
    preds = list(self.read_predictions(output_file))
    return preds

  def predict_proba(self, X):    
    predictions = np.asarray(map(lambda p: 1 / (1 + math.exp(-p)), self.predict(X)))
    return np.vstack([1 - predictions, predictions]).T

总结,在Windows环境下使用libffm非常的困难,不管是编译还是调用,如果环境许可,建议还是在Linux环境下使用。

发表评论

邮箱地址不会被公开。 必填项已用*标注