One of the most important issues that arose over the course of the summer as the students were identifying the URLs relevant to Engineering education, was that of URL ‘conformity’. Our proposed system relies on comparing a listing of URLs returned from Google Custom Search to those stored in Learning Registry documents.
Clearly this is straightforward enough for simple URLs such as:
The situation becomes far more complicated however when we consider URLs containing query strings, since the probability of exact matching becomes much more of a hit-and-miss process. Let’s use YouTube as an example. The simplest form of a YouTube URL is:
v=vHkwq_2yY9E denotes the individual movie to be played. However, it is also common to find additional query string parameters in the URL, such as this one, which specifies that playback should be in HD:
or this one, which specifies that the video should start playing from 1 minute 30 seconds:
And yet another potential problem arises with short links, where
www.youtube.com is shortened to
It is clear therefore that attempting exact URL matching might not be sufficient.
URL encoding and non-Latin web addresses
The situation is complicated even further by ‘urlencoding’, whereby certain characters such as space, plus and & may appear in both encoded on unencoded form. The same applies to URLs containing non-Latin characters; UTF8 web addresses were supported from May 2010. For example, one student identified the following encoded URL:
which decodes to form:
Before we proceed to publish documents into Learning Registry, we want to be confident that the URL matching is as accurate/efficient as it can possibly be. We are therefore taking some time to go through the many thousands of URLs collected by the summer students and decide on the optimum format for the URLs.
A useful meeting today with Nick Syrotiuk from the JLeRN team at MIMAS and also with our project partner Phil Barker from CETIS. The morning was spent reviewing the data collected by our summer students and discussing the optimizing the paradata statements that we will publish into our Learning Registry (LR) node.
In the afternoon, we were joined by John Gilbertson from our Computing Services Department, who has been responsible for installing and setting up the LR node here at Liverpool. Our attention turned towards exactly how we (and potentially others) will want access these data from the LR. Although our primary means of accessing data will be through the
harvest service via URLs, we will also potentially want to find all resources relating to a particular module at Liverpool. In other words, we want to ask the following two questions:
- “send us all the data you have relating the the resource at URL X”
- “send us all the data you have where resource_data->related->object->description=”ENGG109″
The solution to the first is well-established. The second will require using the new
extract service, and Nick kindly offered to help us work out how to use this.
Finally, we discussed how we might join the LR nodes at Liverpool and MIMAS using the LR
distribute service. We agreed that by means of a test, we would aim to set our two nodes up so that everthing from the Liverpool node is distributed to MIMAS.
- Liverpool → MIMAS: all
- MIMAS → Liverpool: only documents relating to Engineering
Once we have successfully demonstrated this functionality, we might consider setting up the Liverpool node so that it takes every Engineering-related document from the National Science Digital Library on the main US Learning Registry node.
We have continued to build on last week’s work and have created the required PHP templates to publish the students’ data into the Learning Registry. We have constructed contextualised usage paradata statements for different types of actions and have manually published a couple of test documents to our University LR node. We have a meeting on Wednesday with external partners to optimise the templates, after which we should hopefully be in a position to bulk publish 25,000+ records into our LR node.
We’ve been busy working through the Learning Registry “Modeling Paradata and Assertions as Activities” and “Paradata Cookbook” documents preparing statement templates from which we will publish the work that summer students did, finding and indexing visual media relevant to their Engineering degrees courses here at Liverpool.