By Simon Davies
This article outlines a powerful new indexing platform for human rights, fundamental freedoms and civil society data. The Index will enable the assigning of unique and highly specific reference codes to billions of items of data ranging from audio and visual material to reports, articles, blogs, forums and research material. It will substantially increase the visibility and effectiveness of information published by rights groups, privacy technology developers and related fields across all languages. The concept is being developed by Code Red in consultation with a range of government, research, philanthropic and professional organisations.
In the tumultuous fight for fundamental freedoms and human rights, there are at least two absolute certainties. First, almost every piece of data that we publish online will disappear from public view in a matter of weeks or months, submerged in the white noise of search results or victim to fatal technical failure. Second, almost every piece of human rights data will be discovered by only the tiniest fraction of people who need to see it.
We like to imagine that the Internet is some sort of vast and permanent library of our output. The reality is far different.
Imagine that you are, say, a small Internet rights group in Argentina running a legal action against government censorship. The chances are that, outside the Spanish speaking realm, few people will ever know about your work – and then usually only if the searcher has prior knowledge of your points of reference. Over time, the visibility of your data will decrease to near-zero.
The same problem presents itself when searching through important archives such as Wikileaks. There is simply no useful mechanism to accurately sift through the million or more documents on that site and identify what you need to find.
We like to imagine that the Internet is some sort of vast and permanent library of our output. The reality is far different. Search engines provide a powerful conduit to the data, but discovery of that information depends on the searcher knowing the right keywords in the right languages. It also depends, of course, on the data actually remaining intact (hands up everyone who has inadvertently lost important online data).
This situation must change. Human rights is a critically important field, and with the emerging pressure of economic crises, political instability and security concerns it is becoming even more so. It’s a vast and complex arena, and finding ways to better communicate the data within it is a huge challenge.
When the new rights group Code Red began consulting in 2013 on what needed to be done to improve the fight for fundamental freedoms, it became clear at a very early stage that a method must be found to organise human rights information in a better way. Important case law in one part of the world needs to be known to advocates working in another language. Great advances in technological protection of privacy need to be discovered and understood by people outside the technology realm, and so on.
To crack this challenge, Code Red participated in over a hundred consultation meetings. It finally concluded that the only obvious way to resolve these problems is for publishers of human rights data to tag information (articles, videos, documents, blogs or whatever) with a numeric code system – an index – that can be easily found and uniquely displayed by search engines.
With this user-generated code, all tagged documents relating to a highly specific subject in whatever language can be identified in one search.
With this user-generated code, all tagged documents relating to a highly specific subject in whatever language can be identified in one search. This approach means that a searcher can reliably identify a precisely relevant document even if it is published in a different language. This benefit is presently impossible to achieve.
This means, for example, that any privacy-aware library in the world setting up a Tor exit relay could attach an index number (e.g. 985645341001) to its published material on their project. Googling for 985645341001 will thus uniquely yield every such project and resource in the world that has tagged the related data with that reference number, regardless of location or language.
Obviously the task of translating documents will still exist, but the key strength of such an index is that searchers will know that a foreign language resource is precisely relevant to what you’re looking for.
In the eight months since the invention of this concept, Code Red has met dozens of organisations from library associations and philanthropic trusts to human rights groups and the European Commission. The idea has withstood scrutiny and has evolved into an elegant and potentially very powerful tool. In this brief article I intend to quickly run through the concept at a high level.
Concept and development partners will be announced in the coming weeks, and a more detailed paper will be published on this site and on the Code Red site by end of 2015.
The index logic
This idea is best imagined as a 12-digit open source index code system, based in part on the logic of the existing Dewey Decimal System (DDS) and in part on the ISBN (International Standard Book Number), but confined to the realm of human rights and the protection of freedoms (DDS is the dominant classification system used by libraries and ISBN is the global index reference number for published works).
As with the DDS, each digit in the reference code will relate to a class of information, then a division of that class followed by further finely grained sections. There is a language and geography locator, a “published medium” field, and activity type field and a digit that is reserved for revocation and visibility status. More on that interesting field later.
One advantage of such a bespoke index is that it can produce highly targeted searches. Presently, searches are conducted on a random and intuitive alpha numeric basis, producing overwhelming “white noise” in the results. However, the proposed system has the potential to uniquely identify the target information. The twelve digit code – in theory – gives the potential to identify up to one trillion categories of data.
The code is structured as follows:
To provide a practical example of how the system can work, imagine that you are seeking all web-based data in any language on the development of collaborative open source circumvention technologies designed to protect VOIP (e.g. Skype) communications from interception by third parties. In the current search environment such data would be almost impossible to discover with any ease or specificity – and it would be extremely difficult to find documentation in other languages. However the relevant documents could be tagged with a specific index number.
It’s important at this point to reiterate that the creation of index codes is distributed, in that anyone publishing a resource can generate and assign a reference number for their work. This will be aided by an app or site that simplifies the process. It should be possible for any author or publisher to generate the code in around two to three minutes. If there are several focus points, the publisher can assign multiple index codes.
The logic is consistent. For example:
- The first field of the index relates to the branch of human rights and the second field identifies the subsets of those broad arenas. The first field thus identifies which of the nine core human rights spheres the data relates to. These include areas such as economic rights, cultural rights, fundamental freedoms and so on.
- The above example relates to privacy, which is a subset of “fundamental freedoms”. So, fundamental freedoms may be represented as number 3 in that first field. Privacy then comes up as, say, number 4 in the second field (privacy being one of the eight core fundamental freedoms). Therefore, all privacy related data of whatever type begins with “3 4”.
- In the case of this particular data (or the search for this data), the topic is communications surveillance. Surveillance is a subset of privacy and is represented in the third field. Communications surveillance is a subset of the general theme of surveillance, and is identified in the fourth field of the index.
- If surveillance is represented as number seven in the third field, and communications surveillance is listed as number seven in the fourth field, then every piece of data dealing with the broad topic of communications surveillance will begin with “3 4 7 7”. All documents relating to technological protection against communications surveillance could be tagged “3 4 7 7 5” and so on.
Thus, communications surveillance data represents around 1/10,000 of the index spectrum in those first four digits. By the time a further three subset digits are added, the topic of the above published work would represent around 1/10,000,000 of the spectrum of the index.
Until now, we’ve focused on the theoretical potential to tag and identify a human rights resource in a highly specific manner. We have discussed how this method can be used across languages to uniquely identify such resources in a way that is far more accurate than the present conventional search process. However, it is important to find ways to ensure that the indexed documents are not only discoverable in a faster way, but also that the system is future-proofed against possible shifts in the search industry.
First, I should explain how the index code is created and embedded. The current thinking is to set up a web application that will help publishers (that is, anyone publishing data online or even offline) to generate the right code for their document. Once the code is generated, that number – together with the associated metadata about the document – is stored in a central index. This will permit instant discovery and much faster and higher-integrity searching.
The application will serve as a front-end to that index, allowing users to add documents and to search through the index. Publishers are then free to attach the relevant index number to their document, enabling discovery by search engines.
An API is used internally by the Index’s web application, but also comes into play once we give a search engine like Startpage/Ixquick/Google etc. access to the index. The engines need some way to query the index programmatically in order for them to display it in their layout/etc on their websites. This is the main purpose the API serves. Both our own web application and the search engines will use the same API, and the API will conduct the actual searching/index maintenance.
The general mechanism looks like this:
The missing link presently is the means by which people searching for material can use the system in a simple way. Obviously the same app used for publishing can also be use by searchers to identify and search under a specific index number. It may, however, be that there is possibility for a word-to-number translation, so that word strings in searches are converted and directly queried against the index. This challenge is presently being assessed.
Of course one glaring issue is the challenge of how to index material that already exists online (or offline). This is where the creation of a central index site is particularly useful. In situations where publishers have not embedded a code for their works, outside contributors can independently associate an item of information with a code and then upload this data into the central index. A means must be found to ensure that this process is not open to abuse.
This outline leaves many unresolved questions. How can this system allow for revocation of material? How can it indicate non-public or secret material? We are convinced that the addition of a “wayback” machine for human rights data (attached to the central index), together with a “control field” (the twelfth digit) in the code has started to provide answers.
This, clearly, is merely a concept overview of the system. The article does not address dozens of obvious questions. These will follow in a detailed document.