Python’s multiprocessing.Pool module is applied to perform parallelization on the computations of all kernels as well as the model selection.
The Fast Computation of Shortest Path Kernel (FCSP) method [8] is implemented in the random walk kernel, the shortest path kernel, as well as the structural shortest path kernel where FCSP is applied on both vertex and edge kernels.
The trie data structure [9] is employed in the path kernel up to length h to store paths in graphs.
This library uses multiprocessing.Pool.imap_unordered function to do the parallelization, which may not be able to run correctly under Windows system. For now, Windows users may need to comment the parallel codes and uncomment the codes below them which run serially. We will consider adding a parameter to control serial or parallel computations as needed.
Some modules (such as Numpy, Scipy, sklearn) apply ``OpenBLAS` <https://www.openblas.net/>`_ to perform parallel computation by default, which causes conflicts with other parallelization modules such as multiprossing.Pool, highly increasing the computing time. By setting its thread to 1, OpenBLAS is forced to use a single thread/CPU, thus avoids the conflicts. For now, this procedure has to be done manually. Under Linux, type this command in terminal before running the code:
$ export OPENBLAS_NUM_THREADS=1
Or add exportOPENBLAS_NUM_THREADS=1 at the end of your ~/.bashrc file, then run
If you have used graphkit-learn in your publication, please cite the the following paper:
@article{JIA2021,title="graphkit-learn: A Python Library for Graph Kernels Based on Linear Patterns",journal="Pattern Recognition Letters",year="2021",issn="0167-8655",doi="https://doi.org/10.1016/j.patrec.2021.01.003",url="http://www.sciencedirect.com/science/article/pii/S0167865521000131",author="Linlin Jia and Benoit Gaüzère and Paul Honeine",keywords="Graph Kernels, Linear Patterns, Python Implementation",abstract="This paper presents graphkit-learn, the first Python library for efficient computation of graph kernels based on linear patterns, able to address various types of graphs. Graph kernels based on linear patterns are thoroughly implemented, each with specific computing methods, as well as two well-known graph kernels based on non-linear patterns for comparative analysis. Since computational complexity is an Achilles’ heel of graph kernels, we provide several strategies to address this critical issue, including parallelization, the trie data structure, and the FCSP method that we extend to other kernels and edge comparison. All proposed strategies save orders of magnitudes of computing time and memory usage. Moreover, all the graph kernels can be simply computed with a single Python statement, thus are appealing to researchers and practitioners. For the convenience of use, an advanced model selection procedure is provided for both regression and classification problems. Experiments on synthesized datasets and 11 real-world benchmark datasets show the relevance of the proposed library."}
This research was supported by CSC (China Scholarship Council) and the French national research agency (ANR) under the grant APi (ANR-18-CE23-0014). The authors would like to thank the CRIANN (Le Centre Régional Informatique et d’Applications Numériques de Normandie) for providing computational resources.
[1] Thomas Gärtner, Peter Flach, and Stefan Wrobel. On graph kernels: Hardness results and efficient alternatives. Learning Theory and Kernel Machines, pages 129–143, 2003.
[2] H. Kashima, K. Tsuda, and A. Inokuchi. Marginalized kernels between labeled graphs. In Proceedings of the 20th International Conference on Machine Learning, Washington, DC, United States, 2003.
[3] Vishwanathan, S.V.N., Schraudolph, N.N., Kondor, R., Borgwardt, K.M., 2010. Graph kernels. Journal of Machine Learning Research 11, 1201–1242.
[4] K. M. Borgwardt and H.-P. Kriegel. Shortest-path kernels on graphs. In Proceedings of the International Conference on Data Mining, pages 74-81, 2005.
[5] Liva Ralaivola, Sanjay J Swamidass, Hiroto Saigo, and Pierre Baldi. Graph kernels for chemical informatics. Neural networks, 18(8):1093–1110, 2005.
[6] Suard F, Rakotomamonjy A, Bensrhair A. Kernel on Bag of Paths For Measuring Similarity of Shapes. InESANN 2007 Apr 25 (pp. 355-360).
[7] Mahé, P., Ueda, N., Akutsu, T., Perret, J.L., Vert, J.P., 2004. Extensions of marginalized graph kernels, in: Proc. the twenty-first international conference on Machine learning, ACM. p. 70.
[8] Lifan Xu, Wei Wang, M Alvarez, John Cavazos, and Dongping Zhang. Parallelization of shortest path graph kernels on multi-core cpus and gpus. Proceedings of the Programmability Issues for Heterogeneous Multicores (MultiProg), Vienna, Austria, 2014.
[9] Edward Fredkin. Trie memory. Communications of the ACM, 3(9):490–499, 1960.
[10] Gaüzere, B., Brun, L., Villemin, D., 2012. Two new graphs kernels in chemoinformatics. Pattern Recognition Letters 33, 2038–2047.
[11] Shervashidze, N., Schweitzer, P., Leeuwen, E.J.v., Mehlhorn, K., Borgwardt, K.M., 2011. Weisfeiler-lehman graph kernels. Journal of Machine Learning Research 12, 2539–2561.
[1] Thomas Gärtner, Peter Flach, and Stefan Wrobel. On graph kernels:
Hardness results and efficient alternatives. Learning Theory and Kernel
Machines, pages 129–143, 2003.
List of graphs between which the kernels are computed.
G1, G2NetworkX graphs
Two graphs between which the kernel is computed.
node_labelstring
Node attribute used as symbolic label. The default node label is ‘atom’.
edge_labelstring
Edge attribute used as symbolic label. The default edge label is ‘bond_type’.
weight: integer
Weight coefficient of different lengths of walks, which represents beta
in ‘exp’ method and gamma in ‘geo’.
compute_methodstring
Method used to compute walk kernel. The Following choices are
available:
‘exp’: method based on exponential serials applied on the direct
product graph, as shown in reference [1]. The time complexity is O(n^6)
for graphs with n vertices.
‘geo’: method based on geometric serials applied on the direct product
graph, as shown in reference [1]. The time complexity is O(n^6) for
graphs with n vertices.
List of walks retrieved, where for unlabeled graphs, each walk is
represented by a list of nodes; while for labeled graphs, each walk
is represented by a string consists of labels of nodes and edges on
that walk.
[1] H. Kashima, K. Tsuda, and A. Inokuchi. Marginalized kernels between
labeled graphs. In Proceedings of the 20th International Conference on
Machine Learning, Washington, DC, United States, 2003.
[2] Pierre Mahé, Nobuhisa Ueda, Tatsuya Akutsu, Jean-Luc Perret, and
Jean-Philippe Vert. Extensions of marginalized graph kernels. In
Proceedings of the twenty-first international conference on Machine
learning, page 70. ACM, 2004.
[1] S Vichy N Vishwanathan, Nicol N Schraudolph, Risi Kondor, and Karsten M Borgwardt. Graph kernels. Journal of Machine Learning Research, 11(Apr):1201–1242, 2010.
List of graphs between which the kernels are computed.
G1, G2NetworkX graphs
Two graphs between which the kernel is computed.
compute_methodstring
Method used to compute kernel. The Following choices are
available:
‘sylvester’ - Sylvester equation method.
‘conjugate’ - conjugate gradient method.
‘fp’ - fixed-point iterations.
‘spectral’ - spectral decomposition.
weightfloat
A constant weight set for random walks of length h.
pNone
Initial probability distribution on the unlabeled direct product graph
of two graphs. It is set to be uniform over all vertices in the direct
product graph.
qNone
Stopping probability distribution on the unlabeled direct product graph
of two graphs. It is set to be uniform over all vertices in the direct
product graph.
edge_weight : float
Edge attribute name corresponding to the edge weight.
node_kernels: dict
A dictionary of kernel functions for nodes, including 3 items: ‘symb’
for symbolic node labels, ‘nsymb’ for non-symbolic node labels, ‘mix’
for both labels. The first 2 functions take two node labels as
parameters, and the ‘mix’ function takes 4 parameters, a symbolic and a
non-symbolic label for each the two nodes. Each label is in form of 2-D
dimension array (n_samples, n_features). Each function returns a number
as the kernel value. Ignored when nodes are unlabeled. This argument
is designated to conjugate gradient method and fixed-point iterations.
edge_kernels: dict
A dictionary of kernel functions for edges, including 3 items: ‘symb’
for symbolic edge labels, ‘nsymb’ for non-symbolic edge labels, ‘mix’
for both labels. The first 2 functions take two edge labels as
parameters, and the ‘mix’ function takes 4 parameters, a symbolic and a
non-symbolic label for each the two edges. Each label is in form of 2-D
dimension array (n_samples, n_features). Each function returns a number
as the kernel value. Ignored when edges are unlabeled. This argument
is designated to conjugate gradient method and fixed-point iterations.
node_label: string
Node attribute used as label. The default node label is atom. This
argument is designated to conjugate gradient method and fixed-point
iterations.
edge_labelstring
Edge attribute used as label. The default edge label is bond_type. This
argument is designated to conjugate gradient method and fixed-point
iterations.
sub_kernel: string
Method used to compute walk kernel. The Following choices are
available:
‘exp’ : method based on exponential serials.
‘geo’ : method based on geometric serials.
List of graphs between which the kernels are computed.
G1, G2NetworkX graphs
Two graphs between which the kernel is computed.
node_labelstring
Node attribute used as label. The default node label is atom.
edge_weightstring
Edge attribute name corresponding to the edge weight.
node_kernelsdict
A dictionary of kernel functions for nodes, including 3 items: ‘symb’
for symbolic node labels, ‘nsymb’ for non-symbolic node labels, ‘mix’
for both labels. The first 2 functions take two node labels as
parameters, and the ‘mix’ function takes 4 parameters, a symbolic and a
non-symbolic label for each the two nodes. Each label is in form of 2-D
dimension array (n_samples, n_features). Each function returns an
number as the kernel value. Ignored when nodes are unlabeled.
List of graphs between which the kernels are computed.
G1, G2NetworkX graphs
Two graphs between which the kernel is computed.
node_labelstring
Node attribute used as label. The default node label is atom.
edge_weightstring
Edge attribute name corresponding to the edge weight. Applied for the
computation of the shortest paths.
edge_labelstring
Edge attribute used as label. The default edge label is bond_type.
node_kernelsdict
A dictionary of kernel functions for nodes, including 3 items: ‘symb’
for symbolic node labels, ‘nsymb’ for non-symbolic node labels, ‘mix’
for both labels. The first 2 functions take two node labels as
parameters, and the ‘mix’ function takes 4 parameters, a symbolic and a
non-symbolic label for each the two nodes. Each label is in form of 2-D
dimension array (n_samples, n_features). Each function returns a number
as the kernel value. Ignored when nodes are unlabeled.
edge_kernelsdict
A dictionary of kernel functions for edges, including 3 items: ‘symb’
for symbolic edge labels, ‘nsymb’ for non-symbolic edge labels, ‘mix’
for both labels. The first 2 functions take two edge labels as
parameters, and the ‘mix’ function takes 4 parameters, a symbolic and a
non-symbolic label for each the two edges. Each label is in form of 2-D
dimension array (n_samples, n_features). Each function returns a number
as the kernel value. Ignored when edges are unlabeled.
compute_methodstring
Computation method to store the shortest paths and compute the graph
kernel. The Following choices are available:
For unlabeled graphs, canonkey is a dictionary which records amount of
every tree pattern. For labeled graphs, canonkey_l is one which keeps
track of amount of every treelet.
List of graphs between which the kernels are computed.
G1, G2NetworkX graphs
Two graphs between which the kernel is computed.
sub_kernelfunction
The sub-kernel between 2 real number vectors. Each vector counts the
numbers of isomorphic treelets in a graph.
node_labelstring
Node attribute used as label. The default node label is atom.
edge_labelstring
Edge attribute used as label. The default edge label is bond_type.
parallelstring/None
Which paralleliztion method is applied to compute the kernel. The
Following choices are available:
‘imap_unordered’: use Python’s multiprocessing.Pool.imap_unordered
method.
None: no parallelization is applied.
n_jobsint
Number of jobs for parallelization. The default is to use all
computational cores. This argument is only valid when one of the
parallelization method is applied.
[1] Liva Ralaivola, Sanjay J Swamidass, Hiroto Saigo, and Pierre
Baldi. Graph kernels for chemical informatics. Neural networks,
18(8):1093–1110, 2005.
List of paths retrieved, where for unlabeled graphs, each path is
represented by a list of nodes; while for labeled graphs, each path is
represented by a list of strings consists of labels of nodes and/or
edges on that path.
List of graphs between which the kernels are computed.
G1, G2NetworkX graphs
Two graphs between which the kernel is computed.
node_labelstring
Node attribute used as label. The default node label is atom.
edge_labelstring
Edge attribute used as label. The default edge label is bond_type.
depthinteger
Depth of search. Longest length of paths.
k_funcfunction
A kernel function applied using different notions of fingerprint
similarity, defining the type of feature map and normalization method
applied for the graph kernel. The Following choices are available:
‘MinMax’: use the MiniMax kernel and counting feature map.
‘tanimoto’: use the Tanimoto kernel and binary feature map.
None: no sub-kernel is used, the kernel is computed directly.
compute_methodstring
Computation method to store paths and compute the graph kernel. The
Following choices are available:
List of graphs between which the kernels are computed.
G1, G2NetworkX graphs
Two graphs between which the kernel is computed.
node_labelstring
Node attribute used as label. The default node label is atom.
edge_labelstring
Edge attribute used as label. The default edge label is bond_type.
heightint
Subtree height.
base_kernelstring
Base kernel used in each iteration of WL kernel. Only default ‘subtree’
kernel can be applied for now.
parallelNone
Which paralleliztion method is applied to compute the kernel. No
parallelization can be applied for now.
n_jobsint
Number of jobs for parallelization. The default is to use all
computational cores. This argument is only valid when one of the
parallelization method is applied and can be ignored for now.
A MAT file contains a struct array containing graphs, and a column vector lx containing a class label for each graph.
Check README in downloadable file for detailed structure.
[1] H. Kashima, K. Tsuda, and A. Inokuchi. Marginalized kernels between
labeled graphs. In Proceedings of the 20th International Conference on
Machine Learning, Washington, DC, United States, 2003.
Perform model selection, fitting and testing for precomputed kernels
using nested CV. Print out neccessary data during the process then finally
the results.
kernel function used to estimate. This function needs to return a gram matrix.
param_grid_precomputeddictionary
Dictionary with names (string) of parameters used to calculate gram
matrices as keys and lists of parameter settings to try as values. This
enables searching over any sequence of parameter settings. Params with
length 1 will be omitted.
param_griddictionary
Dictionary with names (string) of parameters used as penelties as keys
and lists of parameter settings to try as values. This enables
searching over any sequence of parameter settings. Params with length 1
will be omitted.
model_typestring
Type of the problem, can be ‘regression’ or ‘classification’.
NUM_TRIALSinteger
Number of random trials of the outer CV loop. The default is 30.
datafile_ystring
Path of file storing y data. This parameter is optional depending on
the given dataset file.
extra_paramsdict
Extra parameters for loading dataset. See function gklearn.utils.
graphfiles.loadDataset for detail.
The kernels bewteen pairs of vertices in these two graphs are computed.
node_kernelsdict
A dictionary of kernel functions for nodes, including 3 items: ‘symb’
for symbolic node labels, ‘nsymb’ for non-symbolic node labels, ‘mix’
for both labels. The first 2 functions take two node labels as
parameters, and the ‘mix’ function takes 4 parameters, a symbolic and a
non-symbolic label for each the two nodes. Each label is in form of 2-D
dimension array (n_samples, n_features). Each function returns a number
as the kernel value. Ignored when nodes are unlabeled. This argument
is designated to conjugate gradient method and fixed-point iterations.
node_labelslist, optional
The list of the name strings of the node labels. The default is [].
node_attrslist, optional
The list of the name strings of the node attributes. The default is [].
Parallelization of shortest path graph kernels on multi-core cpus and gpus.
Proceedings of the Programmability Issues for Heterogeneous Multicores
(MultiProg), Vienna, Austria, 2014.
This method differs from networkx.tensor_product in that this method only adds nodes and edges in G1 and G2 that have the same labels to the direct product graph.
This method differs from networkx.tensor_product in that this method only adds nodes and edges in G1 and G2 that have the same labels to the direct product graph.
For an input graph G, its corresponding shortest-paths graph S contains the same set of nodes as G, while there exists an edge between all nodes in S which are connected by a walk in G. Every edge in S between two nodes is labeled by the shortest distance between these two nodes.
For an input graph G, its corresponding shortest-paths graph S contains the same set of nodes as G, while there exists an edge between all nodes in S which are connected by a walk in G. Every edge in S between two nodes is labeled by the shortest distance between these two nodes.
The list of basic types in Python. The default is None, which means
the default basic types are used. The default basic types include
int, float, complex, str, bool, NoneType, list,
tuple, dict, set, frozenset, range, slice.
deepbool, optional
Whether to check the object recursively when obj is iterable.
The default is False.
To exhibit the effectiveness and practicability of graphkit-learn library, we tested it on several benchmark datasets. See (Kersting et al., 2016) for details on these datasets.
A two-layer nested cross-validation (CV) is applied to select and evaluate models, where outer CV randomly splits the dataset into 10 folds with 9 as validation set, and inner CV then randomly splits validation set to 10 folds with 9 as training set. The whole procedure is performed 30 times, and the average performance is computed over these trails. Possible parameters of a graph kernel are also tuned during this procedure.
The machine used to execute the experiments is a cluster with 28 CPU cores of Intel(R) Xeon(R) E5-2680 v4 @ 2.40GHz, 252GB memory, and 64-bit operating system CentOS Linux release 7.3.1611. All results were run with Python 3.5.2.
The figure below exhibits accuracies achieved by graph kernels implemented in graphkit-learn library, in terms of regression error (the upper table) and classification rate (the lower table). Red color indicates the worse results and dark green the best ones. Gray cells with the “inf” marker indicate that the computation of the graph kernel on the dataset is omitted due to much higher consumption of computational resources than other kernels.
The figure below displays computational time consumed to compute Gram matrices of each graph
kernels (in \(log10\) of seconds) on each dataset. Color legends have the same meaning as in the figure above.