URL Query Strings and Encoding

One of the most important issues that arose over the course of the summer as the students were identifying the URLs relevant to Engineering education, was that of URL ‘conformity’. Our proposed system relies on comparing a listing of URLs returned from Google Custom Search to those stored in Learning Registry documents.

Clearly this is straightforward enough for simple URLs such as:

  • www.domain.com/folder/image.png.

Query Strings

The situation becomes far more complicated however when we consider URLs containing query strings, since the probability of exact matching becomes much more of a hit-and-miss process. Let’s use YouTube as an example. The simplest form of a YouTube URL is:

  • www.youtube.com/watch?v=vHkwq_2yY9E.

where v=vHkwq_2yY9E denotes the individual movie to be played. However, it is also common to find additional query string parameters in the URL, such as this one, which specifies that playback should be in HD:

  • http://www.youtube.com/vHkwq_2yY9E?hd=1

or this one, which specifies that the video should start playing from 1 minute 30 seconds:

  • http://www.youtube.com/watch?v=vHkwq_2yY9E&t=1m30s

Short Links

And yet another potential problem arises with short links, where www.youtube.com is shortened to youtu.be:

  • http://youtu.be/vHkwq_2yY9E

It is clear therefore that attempting exact URL matching might not be sufficient.

URL encoding and non-Latin web addresses

The situation is complicated even further by ‘urlencoding’, whereby certain characters such as space, plus and & may appear in both encoded on unencoded form. The same applies to URLs containing non-Latin characters; UTF8 web addresses were supported from May 2010. For example, one student identified the following encoded URL:

  • http%3A%2F%2Fjdxy.suda.edu.cn%2Fszll%2Fswzz%2F%E9%99%88%E7%91%B6%E7%A7%91%E7%A0%94%E6%83%85%E5%86%B5%E7%AE%80%E8%A1%A8.swf

which decodes to form:

  • http://jdxy.suda.edu.cn/szll/swzz/陈瑶科研情况简表.swf

What next?

Before we proceed to publish documents into Learning Registry, we want to be confident that the URL matching is as accurate/efficient as it can possibly be. We are therefore taking some time to go through the many thousands of URLs collected by the summer students and decide on the optimum format for the URLs.

Meeting with JLeRN and CETIS

A useful meeting today with Nick Syrotiuk from the JLeRN team at MIMAS and also with our project partner Phil Barker from CETIS. The morning was spent reviewing the data collected by our summer students and discussing the optimizing the paradata statements that we will publish into our Learning Registry (LR) node.

In the afternoon, we were joined by John Gilbertson from our Computing Services Department, who has been responsible for installing and setting up the LR node here at Liverpool. Our attention turned towards exactly how we (and potentially others) will want access these data from the LR. Although our primary means of accessing data will be through the harvest service via URLs, we will also potentially want to find all resources relating to a particular module at Liverpool. In other words, we want to ask the following two questions:

  • “send us all the data you have relating the the resource at URL X”
  • “send us all the data you have where resource_data->related->object->description[1]=”ENGG109″

The solution to the first is well-established. The second will require using the new extract service, and Nick kindly offered to help us work out how to use this.

Finally, we discussed how we might join the LR nodes at Liverpool and MIMAS using the LR distribute service. We agreed that by means of a test, we would aim to set our two nodes up so that everthing from the Liverpool node is distributed to MIMAS.

  • Liverpool → MIMAS: all
  • MIMAS → Liverpool: only documents relating to Engineering

Once we have successfully demonstrated this functionality, we might consider setting up the Liverpool node so that it takes every Engineering-related document from the National Science Digital Library on the main US Learning Registry node.

Learning Registry paradata statements II

We have continued to build on last week’s work and have created the required PHP templates to publish the students’ data into the Learning Registry. We have constructed contextualised usage paradata statements for different types of actions and have manually published a couple of test documents to our University LR node. We have a meeting on Wednesday with external partners to optimise the templates, after which we should hopefully be in a position to bulk publish 25,000+ records into our LR node.

Learning Registry paradata statements

We’ve been busy working through the Learning RegistryModeling Paradata and Assertions as Activities” and “Paradata Cookbook” documents preparing statement templates from which we will publish the work that summer students did, finding and indexing visual media relevant to their Engineering degrees courses here at Liverpool.