One of the most important issues that arose over the course of the summer as the students were identifying the URLs relevant to Engineering education, was that of URL ‘conformity’. Our proposed system relies on comparing a listing of URLs returned from Google Custom Search to those stored in Learning Registry documents.
Clearly this is straightforward enough for simple URLs such as:
www.domain.com/folder/image.png.
Query Strings
The situation becomes far more complicated however when we consider URLs containing query strings, since the probability of exact matching becomes much more of a hit-and-miss process. Let’s use YouTube as an example. The simplest form of a YouTube URL is:
www.youtube.com/watch?v=vHkwq_2yY9E.
where v=vHkwq_2yY9E denotes the individual movie to be played. However, it is also common to find additional query string parameters in the URL, such as this one, which specifies that playback should be in HD:
http://www.youtube.com/vHkwq_2yY9E?hd=1
or this one, which specifies that the video should start playing from 1 minute 30 seconds:
http://www.youtube.com/watch?v=vHkwq_2yY9E&t=1m30s
Short Links
And yet another potential problem arises with short links, where www.youtube.com is shortened to youtu.be:
http://youtu.be/vHkwq_2yY9E
It is clear therefore that attempting exact URL matching might not be sufficient.
URL encoding and non-Latin web addresses
The situation is complicated even further by ‘urlencoding’, whereby certain characters such as space, plus and & may appear in both encoded on unencoded form. The same applies to URLs containing non-Latin characters; UTF8 web addresses were supported from May 2010. For example, one student identified the following encoded URL:
http%3A%2F%2Fjdxy.suda.edu.cn%2Fszll%2Fswzz%2F%E9%99%88%E7%91%B6%E7%A7%91%E7%A0%94%E6%83%85%E5%86%B5%E7%AE%80%E8%A1%A8.swf
which decodes to form:
http://jdxy.suda.edu.cn/szll/swzz/陈瑶科研情况简表.swf
What next?
Before we proceed to publish documents into Learning Registry, we want to be confident that the URL matching is as accurate/efficient as it can possibly be. We are therefore taking some time to go through the many thousands of URLs collected by the summer students and decide on the optimum format for the URLs.