NSWI144 - Data on the Web - Semestral assignment specification

Briefly

Identify source data
Download/Extract/Scrape
Triplify (convert to RDF according to LD principles - *****)
Identify vocabularies to be reused, use them, maybe create your own
Link (internally, externally)
Store
Query & Use - in a demo application
Validate the data

In detail

Part 1 - identification and analysis of data sources

What

At least 3 data sources regarding a related topic - they have to be linkable and sufficiently different
- e.g. 3 tables from 1 database = 1 data source
- e.g. 3 sheets in 1 excel file = 1 data source
- 3 HTML pages from one website = 1 data source
- 1 HTML page, 1 CSV file and 1 XML file, from different websites = 3 data sources
- They should not be triplified (transformed to RDF) yet, or at least do not look at how it is done
UML class diagram-style diagram
- showing entities, relations and attributes in individual datasets and indication of how the datasets can be linked
- You can search the National Open Data Catalog, in British open data, in American open data, the Official portal for European data, or simply on Google or Google Dataset Search.

How

Folder source containing:
1. TXT file description.txt (UTF-8 encoded) with description of the selected data - for each data source: Format (XML/HTML/CSV/...), link to the datasource
2. SVG file diagram.svg (UML class diagrams style) showing concepts and relations in the chosen datasets. You can use any decent UML diagramming tool, e.g. diagrams.net, the commercial Lucidchart, etc.

Common mistakes

Attributes should be primitive values (strings, dates, booleans) of the real world entity represented by the class. When they are more complex, they typically represent another real world entity and its primitive attributes.
Common mistake was including attribute of another real world entity, which should be solved by introducing a new class and a relationship to it. Typically something, which exists independently of the current entity, and has a label. For instance, "country" attribute of a "customer". Country exists independently of the customer. Country name is an attribute of the "country" real world entity, not of the "customer". "Customer" lives in a country - is a relationship.
The diagrams are supposed to show real world entities, their relations and attributes. They are not supposed to show the columns of a CSV table. That is an artefact of the technical representation, not the real world.
The diagrams should show which attributes or which entities will be used for linking later.

Part 2 - transformation of source data into RDF

What

Manually created RDF data samples of your dataset.
- showing how your datasets will look like in RDF Turtle
- at least two data instances from each class (entity type) in each dataset
- define and adhere to IRI patterns
Triplification script (TARQL, XSLT, ...) which can be run on the downloaded data - 1 for each data source
Resulting RDF data has correct format
- it represents the source data
- it uses correctly defined IRI patterns
- it uses correct data types and language tags
- entities from source data are correctly separated and interlinked
Transformation script use a language appropriate for the source data: XSLT (XML), XQuery (XML), R2RML (RDB), Tarql (CSV, Excel), Java+Jsoup (HTML), JSON-LD context for JSON, etc. Avoid coding transformations using a general-purpose programming language if the transformation can be done using a standard language.
In the first phase, correct vocabularies do not have to be used.
For the second phase
- Feedback from the first phase incorporated
- adheres to the "Link not Label" pattern
- Appropriate vocabularies have to be used (SKOS for codelists, RDF Data Cube for statistical data, etc.)
- The resulting datasets and linksets are properly described with metadata according to the DCAT-AP 3.0 standard, using required and recommended properties and codelists.

How

Folder source containing:
- TXT file description.txt additionally containing tools used for triplification of the datasets
- Subfolder download containing the downloaded original data (possibly shortened)
Folder triplification containing subfolders for each dataset. Each dataset subfolder contains:
- Subfolder samples containing the manually created data samples
- Subfolder triplifier containing the triplification scripts
- Subfolder output containing the resulting RDF data (TIP: File extensions: Turtle: .ttl, N-Triples: .nt, RDF/XML: .rdf)
For phase 2, subfolder catalog containing a catalog with records for each output dataset according to DCAT-AP 3.0 and VoID in Turtle syntax

Common mistakes

Using classes in place of predicates: Classes cannot be, in general, used as RDF predicates. This means that in the covered UML class diagram, you should not use classes as associaton names. Typically, this is an error, which can be easily spotted. Classes typically start with an upper-case letter after prefix, e.g. foaf:Person.
Using predicates in place of classes: Predicates cannot be, in general, used as a class. This means that in the covered UML class diagram, you should not use predicates as UML class names. Typically, this is an error, which can be easily spotted. Predicates typically start with a lower-case letter after prefix, e.g. foaf:name.
Missing language tags or data types in RDF: Texts in literals, which are in a natural language, need to have this language identified using a language tag, e.g. @en or @cs. If a literal does not have a language tag, it needs to have a data type, typically from the xsd: namespace, e.g. xsd:dateTime, see XML Schema 1.1 Data types.
Using non-standard XSD types for numbers: RDF has some expected XSD datatypes for numbers. It is xsd:integer for integers, xsd:double for floating point numbers and xsd:decimal for decimal numbers. Typical mistaktes include xsd:float and xsd:int.
Mistaking prefixed and absolute IRIs with weird schemes: In RDF Turtle, my:predicate is a prefixed IRI, <my:predicate> is, however, an absolute IRI with my: scheme instead of http: or https:. This is typically a mistake.
Using schema.org datatypes: Using Schema.org datatypes such as schema:Boolean is not recommended. It means that in RDF, all literals are strings, and they are then interpreted as booleans by a schema.org aware parser, which is often not desirable. RDF has built-in support for xsd: based datatypes, which should be used primarily.
Telephone numbers and e-mails: Telephone numbers in RDF should be represented as resources identified by IRIs <tel:+420123456789> as well as e-mail addresses <mailto:[email protected]>.

Part 3 - Link Discovery (Silk)

What

Use Silk to discover links among your datasets - internal linking
Use Silk to discover links between your datasets and foreign datasets already published on the Web - external linking. You can find those for instance in the LOD cloud, in old DataHubu, in linked.opendata.cz, in DBpedia, in the National Open Data Catalog, on the EU Vocabularies web, etc.
The resulting links will be split into linksets, properly described by VoID metadata

How

Folder linking containing:
- Subfolder internal containing:
  - Subfolder for each linking task between your datasets, named after the linked datasets, and in it:
    1. SILK-LSL script or export of the linking task from Silk Workbench
    2. The resulting linkset in N-Triples linkset.nt
    3. The VoID metadata file void.ttl
- Subfolder external containing:
  - Subfolder for each linking task between your dataset and a foreign dataset, named after the linked datasets, and in it:
    1. SILK-LSL script or export of the linking task from Silk Workbench
    2. The resulting linkset in N-Triples linkset.nt
    3. The VoID metadata file void.ttl

Silk tips

Watch the Activities tab and the console output of Silk - there may be useful error messages present

Part 4 - SPARQL querying

What

Load your datasets, linksets and metadata into a triplestore of your choice
Create non-trivial SPARQL queries demonstrating the added value of triplification and linking of your datasets, i.e. querying the data using the linksets
The queries should contain at least 2x SELECT, 2x CONSTRUCT, 1x ASK and 1x DESCRIBE.
The queries should contain (together, not each separately) the following constructs: FILTER, OPTIONAL, VALUES, BIND, IF.

How

Subfolder queries containing individual SPARQL queries, as *.sparql, UTF-8 encoded text files

Part 5 - simple web application, semantic annotations, validation

What

Create a simple web application in Java, meaningfully using existing libraries for working with RDF (Apache Jena or Eclipse RDF4J)
- The application should issue at least 3 non-trivial SPARQL queries (queries from Part 4 are recommended). At least one of the queries needs to be a SPARQL CONSTRUCT, and the resulting page must not be a simple triple list.
- The web application does not need to use CSS, but it has to consist of a few interlinked web pages - typically, it is a list of entities and a page showing a detail of an entity and some statistics
- TIP: A solution wich only sends an HTTP request in a correct format like this: https://linked.opendata.cz/sparql/?default-graph-uri=&query=select+*+where+%7B%3Fs+%3Fp+%3Fo%7D+LIMIT+100&format=text%2Fhtml and prints the result is insufficient. A reasonable usage of the RDF data model implemented by the libraries has to be shown.
- Use RDFa, Microformats or Microdata to semantically represent the shown data in HTML
Write at least 10 SHACL shapes validating some of your datasets, including descriptions of what they do in comments. Each shape needs to have a message. The shapes must be working, i.e. when you break your dataset, the SHACL validator will report it. They need to include
1. a node shape and a property shape
2. a node shape using a property shape
3. value type constraint
4. cardinality constraint
5. value range constraint
6. string-based constraint
7. property pair constraint
8. logical constraint
9. shape-based constraint
10. various shape severities

How

Folder app containing the source code of the project and the compiled WAR file
Folder shacl containing for each dataset for which you have SHACL shapes
1. <turtle-file-name>.shacl - named after the corresponding Turtle file in triplification/output

Part 6 - presentation

What

Prepare a presentation of your semestral project
It should be 3-4 minutes long
Show the data sources used, which tools were used for triplification and wich new queries you are now able to answer thanks to the interlinking of the chosen datasets.
Show screenshots of your application

How

Folder presentation containing a simple, HTML page with the description, which will be used for the presentation