HypoDisc: Discovering Novel and Potentially Interesting Substructures in Knowledge Graphs
Hypodisc aims to discover novel and potentially interesting substructures in multimodal heterogeneous knowledge bases, encoded as RDF knowledge graphs. Scholars can then use these substructures as a starting point to form new research hypotheses or to support existing ones, or to simply gain more insight into the knowledge, information, and data that is contained in their collections.
Multimodal Hypotheses
Hypodisc supports multimodal information of various types, which, in accordance to the RDF data model, are stored as string literals with an accompanying datatype or language tag. At present, Hypodisc understands numerical and temporal datatypes, as well as natural language and other strings. Literals with language tags are also treated as strings. To discover meaningful patterns in these data, a cluster-based approach is applied to all elements of the same datatype with a certain context.
The full list of supported datatypes can be found below.
Installation
To install this tool you will need the git version control system and a recent Python setup which include pip
.
Clone this repository (and the necessary submodules) using
git
:git clone --recurse-submodules https://gitlab.com/wxwilcke/hypodisc.git
Change directory to the root of the tool:
cd hypodisc/
Install the prerequisites (numpy, sklearn, flask, and pyRDF) and the tool itself using pip:
pip install .
Hypodisc is now installed and ready to use.
Usage
Hypodisc accepts the follow parameters:
usage: hypodisc [-h] -d DEPTH -s MIN_SUPPORT [-o OUTPUT] [--exclude EXCLUDE] [--max_size MAX_SIZE] [--max_width MAX_WIDTH] [--mode {A,T,AT,TA}] [--textual_support | --no-textual_support] [--numerical_support | --no-numerical_support] [--temporal_support | --no-temporal_support] [--namespace NAMESPACE] [--parallel] [--p_explore P_EXPLORE] [--p_extend P_EXTEND] [--dry_run] [--seed SEED] [--strategy {BFS,DFS}] [--verbose] [--version] input [input ...]
positional arguments:
input One or more knowledge graphs in (gzipped) NTriple or NQuad serialization format.
options:
-h, --help show this help message and exit
-d DEPTH, --depth DEPTH
Depths to explore. Takes a range 'from:to', or a shorthand ':to' or 'to' if all depths up to that point are to be considered.
-s MIN_SUPPORT, --min_support MIN_SUPPORT
Minimal pattern support.
-o OUTPUT, --output OUTPUT
Path to write output to.
--exclude EXCLUDE Exclude one or more predicates from being considered as building block for pattern.
--max_size MAX_SIZE Maximum context size
--max_width MAX_WIDTH
Maximum width of shell
--mode {A,T,AT,TA} A[box], T[box], or both as candidates to be included in the pattern
--textual_support, --no-textual_support
Cluster on textual literals
--numerical_support, --no-numerical_support
Cluster on numerical literals
--temporal_support, --no-temporal_support
Cluster on temporal literals
--namespace NAMESPACE
Add a custom prefix:namespace pair to be used in the output. This parameter can be used more than once to provide multiple mappings. Must be provided as 'prefix:namespace', eg 'ex:http://example.org/'.
--parallel Speed up the computation by distributing the search across multiple CPU cores
--p_explore P_EXPLORE
Probability of exploring candidate endpoint.
--p_extend P_EXTEND Probability of extending at candidate endpoint.
--dry_run Dry run without saving results.
--seed SEED Set the seed for the random number generator.
--strategy {BFS,DFS} Perform breadth-first (BFS) or depth-first search (DFS). BFS has the anytime property; DFS uses less memory.
--verbose, -v Print debug messages and warnings
--version show program's version number and exit
Walkthrough
For any RDF knowledge graph in N-Triple or N-Quad format, run the tool using:
python hypodisc/run.py --depth <DEPTH> --min_support <SUPPORT> [<KNOWLEDGE_GRAPH>, ...]
See the test/
directory for several examples.
Pattern Browser
A simple facet browser can be used to interactively inspect the discovered patterns:
python hypodisc/browse.py <PATTERN_GRAPH>
The full range of options are listed below.
usage: browser [-h] [--base_ns BASE_NS] [--pagesize PAGESIZE] [--port PORT] [--suppress_browser] [--verbose] input
positional arguments:
input An N-Triple file containing queries with hypotheses
options:
-h, --help show this help message and exit
--base_ns BASE_NS Provide a custom base namespace
--pagesize PAGESIZE Number of queries to show per page.
--port PORT Change the default port
--suppress_browser Suppress the call to the browser
--verbose, -v Print debug messages and warnings
Supported datatypes
The following datatypes are supported by Hypodisc:
Numbers:
- xsd:decimal
- xsd:double
- xsd:float
- xsd:integer
- xsd:long
- xsd:int
- xsd:short
- xsd:byte
- xsd:nonNegativeInteger
- xsd:nonPositiveInteger
- xsd:negativeInteger
- xsd:positiveInteger
- xsd:unsignedLong
- xsd:unsignedInt
- xsd:unsignedShort
- xsd:unsignedByte
Time/date:
- xsd:date
- xsd:dateTime
- xsd:dateTimeStamp
- xsd:gYear
- xsd:gYearMonth
- xsd:gMonthDay
- xsd:gMonth
- xsd:gDay
Strings:
- xsd:string
- xsd:normalizedString
- xsd:token
- xsd:language
- xsd:Name
- xsd:NCName
- xsd:ENTITY
- xsd:ID
- xsd:IDREF
- xsd:NMTOKEN
- xsd:anyURI
Indices and tables
Acknowledgements
The development of this tool is funded by the CLARIAH project.