NPRG036 - Data Formats - Homework 2 assignment

What

In this homework, you will create data samples and queries using graph data models and query languages.

Fix the conceptual model from the previous homework based on the tutor's notes.
For representation of data in RDF, you will need to cover the conceptual model with classes and properties either from existing vocabularies, or by defining new ones using RDF Schema. For the newly defined classes and properties use the @prefix ex: <http://example.org/vocabulary/> . prefix and attach a human readable label to each one of them. For new properties specify domain and range.
Example of a conceptual model:
Example of a covered model:
Represent the data in RDF using RDF Turtle and validate. Each RDF resource will have a type.
Load the RDF data into a triplestore such as Apache Jena Fuseki or Openlink Virtuoso Open-Source (or both). There is a triplestore quick-start slide deck prepared. Openlink Virtuoso also allows you to browse the data in their Faceted search plugin.
Using SPARQL, develop a few meaningful executable queries on top of your data, including a comment explaining what each query does.
For representation of data in LPG, create a visualization of a data sample like the one in the lecture. Use edge properties. Each node and relationship will have a label representing its type.
Represent the data in LPG using an executable Cypher script (see the Movies example in tutorials). The data should match the data in RDF as to quantity and meaning. It might differ in the usage of edge properties, which are not present in RDF.
Load the LPG data into Neo4j.
Using Cypher, develop a few meaningful queries on top of your data, including a comment explaining what each query does.

Quantitative requirements

At least 3 instances of each class. In case of inheritance hierarchies, one of each children is enough.
Every attribute used at least once.
At least 3 instances of each association.
At least 4 non-trivial SPARQL queries
At least 4 non-trivial Cypher queries

How

Replace the HW1 file with a fixed one in the HW1 column.
To the HW2 column, upload a zipped file named NPRG036-<groupID>-HW2.zip, e.g. NPRG036-T1G4-HW2.zip.
Zip file will contain folder 2, containing:
1. Folder rdf containing:
  1. Folder model containing:
    1. File covered.svg with the conceptual model covered by RDF vocabularies
  2. Folder data containing:
    1. File data.ttl with data in valid RDF Turtle
  3. Folder queries containing:
    1. Files query-<number>.sparql, such as query-1.sparql with an executable query and comment
2. Folder lpg containing:
  1. Folder model containing:
    1. File example.svg with the example of the LPG structure
  2. Folder data containing:
    1. File data.cypher with data in Cypher script, loadable to Neo4j
  3. Folder queries containing:
    1. Files query-<number>.cypher, such as query-1.cypher with an executable query and comment

Frequently Asked Questions (FAQ)

What is a trivial query?: Listing of entities of a certain type, optionally filtered by a certain value.; Counting of entities of a certain type, optionally filtered by a certain value.

Common errors

Using classes in place of predicates: Classes cannot be, in general, used as RDF predicates. This means that in the covered UML class diagram, you should not use classes as associaton names. Typically, this is an error, which can be easily spotted. Classes typically start with an upper-case letter after prefix, e.g. foaf:Person.
Using predicates in place of classes: Predicates cannot be, in general, used as a class. This means that in the covered UML class diagram, you should not use predicates as UML class names. Typically, this is an error, which can be easily spotted. Predicates typically start with a lower-case letter after prefix, e.g. foaf:name.
Missing language tags or data types in RDF: Texts in literals, which are in a natural language, need to have this language identified using a language tag, e.g. @en or @cs. If a literal does not have a language tag, it needs to have a data type, typically from the xsd: namespace, e.g. xsd:dateTime, see XML Schema 1.1 Data types.
Mistaking prefixed and absolute IRIs with weird schemes: In RDF Turtle, my:predicate is a prefixed IRI, <my:predicate> is, however, an absolute IRI with my: scheme instead of http: or https:. This is typically a mistake.
Wrong date syntax in LPG: Also in LPG the syntax for dates is the same as in xsd:date, i.e. YYYY-MM-DD.
Mistakes in RDF Turtle prefixes: When creating a RDFS vocabulary, you define rdfs:Class and rdf:Property instances.
Multiple rdfs:domain or rdfs:range: When you say that your property (instance of rdf:Property) has multiple domains or ranges, you are saying that an RDF resource, which will be in the subject (domain) or object (range) position of a triple with this predicate, are instances of all the classes specified as rdfs:domain or rdfs:range. This is often wrong, as it is often meant as a way of saying that a domain or range includes multiple classes, which is not what it says.
Using schema.org datatypes: Using Schema.org datatypes such as schema:Boolean is not recommended. It means that in RDF, all literals are strings, and they are then interpreted as booleans by a schema.org aware parser, which is often not desirable. RDF has built-in support for xsd: based datatypes, which should be used primarily.
Not including human readable labels for instances or user-defined classes and predicates: Every RDF resource should have a human readable label, so that it can be actually shown to end-users in potential applications. For instances, those are typically dcterms:title, foaf:name or skos:prefLabel. For classes and predicates, this is rdfs:label.
Using xsd:string for predicates with values being human readable text: Human readable text, represented as a literal with a language tag, is of a datatype rdf:langString
Using xsd:float for floating point numbers: In RDF, xsd:double is used rather than xsd:float. This is supported by the existing RDF Turtle and SPARQL syntax shortcuts for xsd:double.
Insufficiently defined IRI patterns: Typical IRI patterns for instances (data) contain hint of a type (based on class name) and an identifier, e.g. https://data.mff.cuni.cz/people/John-Doe, https://data.mff.cuni.cz/rooms/S4, etc. It is necessary to first establish those patterns and then stick to them. Will it be /people, /person, /Person, John-Doe, john-doe, johnDoe …?
Telephone numbers and e-mails as literals: In RDF, telephone numbers and e-mails have their own IRI schemes tel:+420123456789 and mailto:[email protected] - therefore, they are resources (rdfs:Resource), not literals.
Not using LPG relation properties: One of the major differences of LPG from RDF are relation properties. The assignment asks for their usage in LPG and Cypher. If you do not have any part of the conceptual model suitable to be represented as relation properties, you can add something.