HypoDisc: Discovering Novel and Potentially Interesting Substructures in Knowledge Graphs
Hypodisc aims to discover novel and potentially interesting substructures in multimodal heterogeneous knowledge bases, encoded as RDF knowledge graphs. Scholars can then use these substructures as a starting point to form new research hypotheses or to support existing ones, or to simply gain more insight into the knowledge, information, and data that is contained in their collections.
Multimodal Hypotheses
Hypodisc supports multimodal information of various types, which, in accordance to the RDF data model, are stored as string literals with an accompanying datatype or language tag. At present, Hypodisc understands numerical and temporal datatypes, as well as natural language and other strings. Literals with language tags are also treated as strings. To discover meaningful patterns in these data, a cluster-based approach is applied to all elements of the same datatype with a certain context.
The full list of supported datatypes can be found below.
Installation
To install this tool you will need the git version control system and a recent Python setup which include pip
.
Clone this repository using
git
:git clone https://gitlab.com/wxwilcke/hypodisc.git
Change directory to the root of the tool:
cd hypodisc/
Install the prerequisites (numpy, sklearn, and pyRDF) and the tool itself using pip:
pip install .
Hypodisc is now installed and ready to use.
Usage
Hypodisc accepts the follow parameters:
hypodisc [-h] -d DEPTH -s MIN_SUPPORT [-o OUTPUT] [--max_size MAX_SIZE] [--max_width MAX_WIDTH] [--mode {A,T,AT,TA}] [--multimodal] [--namespace NAMESPACE] [--p_explore P_EXPLORE] [--p_extend P_EXTEND] [--dry_run] [--seed SEED] [--verbose] [--version] input [input ...]
positional arguments:
input One or more knowledge graphs in (gzipped) NTriple or NQuad serialization format.
options:
-h, --help
show this help message and exit
-d DEPTH, --depth DEPTH
Depths to explore. Takes a range 'from:to', or a shorthand ':to' or 'to' if all depths up to that point are to be considered.
-s MIN_SUPPORT, --min_support MIN_SUPPORT
Minimal pattern support.
-o OUTPUT, --output OUTPUT
Path to write output to (defaults to current directory)
--max_size MAX_SIZE
Maximum context size (defaults to inf.)
--max_width MAX_WIDTH
Maximum width of shell (defaults to inf.)
--mode {A,T,AT}
A[box], T[box], or both as candidates to be included in the pattern (defaults to AT)
--multimodal
Enable multimodal support (defaults to True)
--namespace NAMESPACE
Add a custom prefix:namespace pair to be used in the output. This parameter can be used more than once to provide multiple mappings. Must be provided as 'prefix:namespace', eg 'ex:http://example.org/'.
--p_explore P_EXPLORE
Probability of exploring candidate endpoint (defaults to 1.0)
--p_extend P_EXTEND
Probability of extending at candidate endpoint (defaults to 1.0)
--dry_run
Dry run without saving results (defaults to False)
--seed SEED
Set the seed for the random number generator (default to current time)
--verbose, -v
Print debug messages and warnings
--version
show program's version number and exit
Walkthrough
For any RDF knowledge graph in N-Triple or N-Quad format, run the tool using:
python hypodisc/run.py --depth <DEPTH> --min_support <SUPPORT> [<KNOWLEDGE_GRAPH>, ...]
See the test/
directory for an example.
Supported datatypes
The following datatypes are supported by Hypodisc:
Numbers:
- xsd:decimal
- xsd:double
- xsd:float
- xsd:integer
- xsd:long
- xsd:int
- xsd:short
- xsd:byte
- xsd:nonNegativeInteger
- xsd:nonPositiveInteger
- xsd:negativeInteger
- xsd:positiveInteger
- xsd:unsignedLong
- xsd:unsignedInt
- xsd:unsignedShort
- xsd:unsignedByte
Time/date:
- xsd:date
- xsd:dateTime
- xsd:dateTimeStamp
- xsd:gYear
- xsd:gYearMonth
- xsd:gMonthDay
- xsd:gMonth
- xsd:gDay
Strings:
- xsd:string
- xsd:normalizedString
- xsd:token
- xsd:language
- xsd:Name
- xsd:NCName
- xsd:ENTITY
- xsd:ID
- xsd:IDREF
- xsd:NMTOKEN
- xsd:anyURI
Contents:
Indices and tables
Acknowledgements
The development of this tool is funded by the CLARIAH project.