NSWI144 - Data on the Web - Semestral assignment specification
Briefly
- Identify source data
- Download/Extract/Scrape
- Triplify (convert to RDF according to LD principles - *****)
- Identify vocabularies to be reused, use them, maybe create your own
- Link (internally, externally)
- Store
- Query & Use – in a demo application
- Validate the data
In detail
Part 1 - identificaton and analysis of data sources
What
- At least 3 data sources regarding a related topic - they have to be linkable and sufficiently different
- e.g. 3 tables from 1 database = 1 data source
- e.g. 3 sheets in 1 excel file = 1 data source
- 3 HTML pages from one website = 1 data source
- 1 HTML page, 1 CSV file and 1 XML file, from different websites = 3 data sources
- They should not be triplified (transformed to RDF) yet, or at least do not look at how it is done
- UML class diagram-style diagram
How
-
Folder
source
containing:
- TXT file
description.txt
(UTF-8 encoded) with description of the selected data - for each data source: Format (XML/HTML/CSV/...), link to the datasource
-
SVG file
diagram.svg
(UML class diagrams style) showing concepts and relations in the chosen datasets.
You can use any decent UML diagramming tool, e.g. diagrams.net, the commercial Lucidchart, etc.
Common mistakes
-
Attributes should be primitive values (strings, dates, booleans) of the real world entity represented by the class.
When they are more complex, they typically represent another real world entity and its primitive attributes.
-
Common mistake was including attribute of another real world entity, which should be solved by introducing a new class and a relationship to it.
Typically something, which exists independently of the current entity, and has a label. For instance, "country" attribute of a "customer".
Country exists independently of the customer.
Country name is an attribute of the "country" real world entity, not of the "customer".
"Customer" lives in a country - is a relationship.
-
The diagrams are supposed to show real world entities, their relations and attributes.
They are not supposed to show the columns of a CSV table.
That is an artefact of the technical representation, not the real world.
-
The diagrams should show which attributes or which entities will be used for linking later.
Part 2 - transformation of source data into RDF
What
-
Manually created RDF data samples of your dataset.
- showing how your datasets will look like in RDF Turtle
- at least two data instances from each class (entity type) in each dataset
- define and adhere to IRI patterns
-
Triplification script (TARQL, XSLT, ...) which can be run on the downloaded data - 1 for each data source
-
Resulting RDF data has correct format
- it represents the source data
- it uses correctly defined IRI patterns
- it uses correct data types and language tags
- entities from source data are correctly separated and interlinked
-
Transformation script use a language appropriate for the source data: XSLT (XML), XQuery (XML), R2RML (RDB), Tarql (CSV, Excel), Java+Jsoup (HTML), JSON-LD context for JSON, etc.
Avoid coding transformations using a general-purpose programming language if the transformation can be done using a standard language.
-
In the first phase, correct vocabularies do not have to be used.
- For the second phase
- Feedback from the first phase incorporated
- adheres to the "Link not Label" pattern
- Appropriate vocabularies have to be used (SKOS for codelists, RDF Data Cube for statistical data, etc.)
- The resulting datasets and linksets are properly described with metadata according to the DCAT-AP 3.0 standard, using required and recommended properties and codelists.
How
- Folder
source
containing:
- TXT file
description.txt
additionally containing tools used for triplification of the datasets
- Subfolder
download
containing the downloaded original data (possibly shortened)
-
Folder
triplification
containing subfolders for each dataset.
Each dataset subfolder contains:
- Subfolder
samples
containing the manually created data samples
- Subfolder
triplifier
containing the triplification scripts
- Subfolder
output
containing the resulting RDF data (TIP: File extensions: Turtle: .ttl
, N-Triples: .nt
, RDF/XML: .rdf
)
- For phase 2, subfolder
catalog
containing a catalog with records for each output dataset according to DCAT-AP 3.0 and VoID in Turtle syntax
Common mistakes
- Using classes in place of predicates
-
Classes cannot be, in general, used as RDF predicates.
This means that in the covered UML class diagram, you should not use classes as associaton names.
Typically, this is an error, which can be easily spotted.
Classes typically start with an upper-case letter after prefix, e.g.
foaf:Person
.
- Using predicates in place of classes
-
Predicates cannot be, in general, used as a class.
This means that in the covered UML class diagram, you should not use predicates as UML class names.
Typically, this is an error, which can be easily spotted.
Predicates typically start with a lower-case letter after prefix, e.g.
foaf:name
.
-
Missing language tags or data types in RDF
- Texts in literals, which are in a natural language, need to have this language identified using a language tag, e.g.
@en
or @cs
.
If a literal does not have a language tag, it needs to have a data type, typically from the xsd:
namespace, e.g. xsd:dateTime
, see XML Schema 1.1 Data types. - Using non-standard XSD types for numbers
- RDF has some expected XSD datatypes for numbers. It is
xsd:integer
for integers, xsd:double
for floating point numbers and xsd:decimal
for decimal numbers. Typical mistaktes include xsd:float
and xsd:int
. - Mistaking prefixed and absolute IRIs with weird schemes
- In RDF Turtle,
my:predicate
is a prefixed IRI, <my:predicate>
is, however, an absolute IRI with my:
scheme instead of http:
or https:
.
This is typically a mistake.
- Using schema.org datatypes
-
Using Schema.org datatypes such as
schema:Boolean
is not recommended.
It means that in RDF, all literals are strings, and they are then interpreted as booleans by a schema.org aware parser, which is often not desirable.
RDF has built-in support for xsd:
based datatypes, which should be used primarily.
- Telephone numbers and e-mails
-
Telephone numbers in RDF should be represented as resources identified by IRIs
<tel:+420123456789>
as well as e-mail addresses <mailto:[email protected]>
.
Part 3 - Link Discovery (Silk)
What
- Use Silk to discover links among your datasets - internal linking
-
Use Silk to discover links between your datasets and foreign datasets already published on the Web - external linking.
You can find those for instance in the LOD cloud, in old DataHubu, in linked.opendata.cz, in DBpedia, in the National Open Data Catalog, on the EU Vocabularies web, etc.
-
The resulting links will be split into linksets, properly described by VoID metadata
How
- Folder
linking
containing:
- Subfolder
internal
containing:
- Subfolder for each linking task between your datasets, named after the linked datasets, and in it:
- SILK-LSL script or export of the linking task from Silk Workbench
- The resulting linkset in N-Triples
linkset.nt
- The VoID metadata file
void.ttl
- Subfolder
external
containing:
- Subfolder for each linking task between your dataset and a foreign dataset, named after the linked datasets, and in it:
- SILK-LSL script or export of the linking task from Silk Workbench
- The resulting linkset in N-Triples
linkset.nt
- The VoID metadata file
void.ttl
Silk tips
-
Watch the Activities tab and the console output of Silk - there may be useful error messages present
Part 4 - SPARQL querying
What
- Load your datasets, linksets and metadata into a triplestore of your choice
- Create non-trivial SPARQL queries demonstrating the added value of triplification and linking of your datasets, i.e. querying the data using the linksets
- The queries should contain at least 2x SELECT, 2x CONSTRUCT, 1x ASK and 1x DESCRIBE.
- The queries should contain (together, not each separately) the following constructs: FILTER, OPTIONAL, VALUES, BIND, IF.
How
- Subfolder
queries
containing individual SPARQL queries, as *.sparql
, UTF-8 encoded text files
Part 5 - simple web application, semantic annotations, validation
What
- Create a simple web application in Java, meaningfully using existing libraries for working with RDF (Apache Jena or Eclipse RDF4J)
- The application should issue at least 3 non-trivial SPARQL queries (queries from Part 4 are recommended). At least one of the queries needs to be a SPARQL CONSTRUCT, and the resulting page must not be a simple triple list.
- The web application does not need to use CSS, but it has to consist of a few interlinked web pages - typically, it is a list of entities and a page showing a detail of an entity and some statistics
- TIP: A solution wich only sends an HTTP request in a correct format like this:
https://linked.opendata.cz/sparql/?default-graph-uri=&query=select+*+where+%7B%3Fs+%3Fp+%3Fo%7D+LIMIT+100&format=text%2Fhtml
and prints the result is insufficient. A reasonable usage of the RDF data model implemented by the libraries has to be shown.
- Use RDFa, Microformats or Microdata to semantically represent the shown data in HTML
-
Write at least 10 SHACL shapes validating some of your datasets, including descriptions of what they do in comments.
Each shape needs to have a message.
The shapes must be working, i.e. when you break your dataset, the SHACL validator will report it.
They need to include
- a node shape and a property shape
- a node shape using a property shape
- value type constraint
- cardinality constraint
- value range constraint
- string-based constraint
- property pair constraint
- logical constraint
- shape-based constraint
- various shape severities
How
- Folder
app
containing the source code of the project and the compiled WAR file
- Folder
shacl
containing for each dataset for which you have SHACL shapes
-
<turtle-file-name>.shacl
- named after the corresponding Turtle file in triplification/output
Part 6 - presentation
What
- Prepare a presentation of your semestral project
- It should be 3-4 minutes long
- Show the data sources used, which tools were used for triplification and wich new queries you are now able to answer thanks to the interlinking of the chosen datasets.
- Show screenshots of your application
How
- Folder
presentation
containing a simple, HTML page with the description, which will be used for the presentation