NSWI144 - Data on the Web - Semestral assignment specification

Briefly

  1. Identify source data
  2. Download/Extract/Scrape
  3. Triplify (convert to RDF according to LD principles - *****)
  4. Identify vocabularies to be reused, use them, maybe create your own
  5. Link (internally, externally)
  6. Store
  7. Query & Use – in a demo application

In detail

Part 1 - identificaton and analysis of data sources

What

  1. At least 3 data sources regarding a related topic - they have to be linkable and sufficiently different
  2. e.g. 3 tables from 1 database = 1 data source
  3. e.g. 3 sheets in 1 excel file = 1 data source
  4. 3 HTML pages from one website = 1 data source
  5. 1 HTML page, 1 CSV file and 1 XML file, from different websites = 3 data sources
  6. They should not be triplified (transformed to RDF) yet
  7. UML-style class diagram with an overview of entities and their properties, in which datasets they are located, and how they can be linked
  8. You can search the National Open Data Catalog, in British open data or simply on Google.

How

  1. Folder source containing:
    1. TXT file description.txt (UTF-8 encoded) with description of the selected data - for each data source: Format (XML/HTML/CSV/...), link to the datasource
    2. PNG file diagram.png (UML class diagrams style) showing concepts and relations in the chosen datasets

Part 2 - transformation of source data into RDF

What

  1. Triplification script (TARQL, XSLT, ...) which can be run on the downloaded data - 1 for each data source
  2. Resulting RDF data has correct format
    • it represents the source data
    • it uses correctly defined IRI patterns
    • it uses correct data types and language tags
    • entities from source data are correctly separated and interlinked
    • adheres to the "Link not Label" pattern
  3. Transformation script use a language appropriate for the source data: XSLT (XML), XQuery (XML), R2RML (RDB), Tarql (CSV, Excel), Java+Jsoup (HTML), JSON-LD context for JSON
  4. In the first phase, correct vocabularies do not have to be used.
  5. For the second phase
    • Feedback from the first phase incorporated
    • Appropriate vocabularies have to be used (SKOS for codelists, RDF Data Cube for statistical data, etc.)
    • The resulting datasets and linksets are properly described with metadata according to the DCAT-AP v2.0.1 standard, using required and recommended properties and codelists.

How

  1. Folder source containing:
    • TXT file description.txt additionally containing tools used for triplification of the datasets
  2. Folder download containing subfolders for each dataset. Each dataset subfolder contains:
    • Subfolder input containing the downloaded original data
    • Subfolder triplifier containing the triplification scripts
    • Subfolder output containing the resulting RDF data (TIP: File extensions: Turtle: .ttl, N-Triples: .nt, RDF/XML: .rdf)
  3. For phase 2, subfolder catalog containing a catalog with records for each output dataset according to DCAT-AP v2.0.1 and VoID in Turtle syntax

Part 3 - Link Discovery (Silk)

What

  1. Use Silk to discover links among your datasets
  2. Use Silk to discover links between your datasets and foreign datasets already published on the Web. You can find those for instance in the LOD cloud, in old DataHubu, in linked.opendata.cz, in DBpedia, in the National Open Data Catalog, on the EU Vocabularies web, etc.
  3. The resulting links will be split into linksets, properly described by VoID metadata

How

  1. Folder linking containing:
    1. Subfolder internal containing:
      1. Subfolder for each linking task between your datasets, named after the linked datasets, and in it:
        1. SILK-LSL script or export of the linking task from Silk Workbench
        2. The resulting linkset in N-Triples
        3. The VoID metadata file
    2. Subfolder external containing:
      1. Subfolder for each linking task between your dataset and a foreign dataset, named after the linked datasets, and in it:
        1. SILK-LSL script or export of the linking task from Silk Workbench
        2. The resulting linkset in N-Triples
        3. The VoID metadata file

Part 4 - SPARQL querying

What

  1. Load your datasets, linksets and metadata into a triplestore of your choice
  2. Create non-trivial SPARQL queries demonstrating the added value of triplification and linking of your datasets
  3. The queries should contain at least 2x SELECT, 2x CONSTRUCT, 1x ASK and 1x DESCRIBE.
  4. The queries should contain (together, not each separately) the following constructs: FILTER, OPTIONAL, VALUES, BIND, IF.

How

  1. Subfolder queries containing individual SPARQL queries, as *.sparql, UTF-8 encoded text files

Part 5 - simple web application working with an RDF library and creating semantic annotations

What

  1. Create a simple web application in Java, meaningfully using existing libraries for working with RDF (Apache Jena or Eclipse RDF4J)
  2. The application should issue at least 3 non-trivial SPARQL queries (queries from Part 4 are recommended). At least one of the queries needs to be a SPARQL CONSTRUCT, and the resulting page must not be a simple triple list.
  3. The web application does not need to use CSS, but it has to consist of a few interlinked web pages - typically, it is a list of entities and a page showing a detail of an entity and some statistics
  4. TIP: A solution wich only sends an HTTP request in a correct format like this: https://linked.opendata.cz/sparql/?default-graph-uri=&query=select+*+where+%7B%3Fs+%3Fp+%3Fo%7D+LIMIT+100&format=text%2Fhtml and prints the result is insufficient. A reasonale usage of the RDF data model implemented by the libraries has to be shown.
  5. Use RDFa, Microformats or Microdata to semantically represent the shown data in HTML

How

  1. Folder app containing the source code of the project and the compiled WAR file

Part 6 - presentation

What

  1. Prepare a presentation of your semestral project
  2. It should be 3 minutes long
  3. Show the data sources used, which tools were used for triplification and wich new queries you are now able to answer thanks to the interlinking of the chosen datasets.
  4. Show screenshots of your application

How

  1. Folder presentation containing a simple, HTML page with the description, which will be used for the presentation