Would you like to have 3847 Excel spreadsheets containing the majority of significant international aid spending information over the last 6 years, from over 450 organizations? Good – if you find it useful, please let me know in the comments.
If that wasn’t quite enough to reel you in, read on for more details.
IATI Data
IATI is the International Aid Transparency Initiative; its aim is to “[make] information about aid spending easier to access, use and understand.” They’ve worked hard for several years with most (all?) major international aid organizations to collect sets of spending data, coded against agreed standards.
They provide a registry of all data made available by contributing organizations. The dump at http://data.datamediate.com/iati_files/ is an automatically-created extract of all of their data, converted to Excel fules. There was previously no way to get that data in an Excel-friendly format.
Why is this interesting? Because this data is very interesting to a lot of people who might otherwise be quite motivated to work with it. This includes anyone working strategically with the third sector: consultants, organizational strategists, political researchers, parliamentarians, think tanks, etc. But – without any disrespect to IATI’s enormous efforts in accumulating this data – as currently distributed it’s been effectively unavailable to most of these people.
IATI make various cuts of the dataset available as CSV files and a few third parties have made tools available around segments of the data, like D-Portal. But if you just want to dig into the source data, the only way to get at it is as XML documents. To its credit, IATI has produced a very well-structured & comprehensively documented XML format. However, most potential users think in Excel, and hardly any will be comfortable dealing with XML.
XML?
There are good reasons for IATI to work mainly in XML. Complex data doesn’t fit naturally into the simple tabular format that Excel requires: XML allows for much richer data structures, not to mention a wealth of processing and data modelling tools. For a data management organization, XML is absolutely the right choice to manage source data, and with my data-developer hat on, I’m happy to have a clearly documented structured source to deal with.
But as a data consumer / explorer I don’t care. I’m not going to put the effort into building an XML-reading analysis tool until I’ve got a fairly good sense of what the data might show. For that, I need to poke around in Excel.
Creating the Excel files
A bit of technical explanation so you know where the Excel files come from.
OpenRefine (previously Google Refine) is “A free, open source, powerful tool for working with messy data”. Having worked across several organizations full of messy data, I’m already a huge fan of OpenRefine. It’s the only tool accessible to non-developers which can take a more or less arbitrary XML file and present it in a usable format. Since the IATI XML format is well-structured, OpenRefine can present very nice tabular output, as long as you’re not afraid of enormous tables. But it’s not a quick tool to work with when dealing with complex data. On my creaky old laptop it can take up to 10 minutes to read in and show me one IATI file.
OpenRefine has an API – although from my experimentation the documentation is rather out of date: I ended up having to reverse engineer the OpenRefine protocol for uploading and converting data. (If you want to try any of this yourself, you can use my forked version of the OpenRefine client library.) Once that’s done, reading in and processing an XML file takes less than a second.
Stringing that all together, you can scrape the IATI registry via their Solr-driven registry API, then feed each result automatically through OpenRefine, and export the result to a series of Excel files. A bit of postprocessing around presentation and tidying up errors gives the data dump presented above.
Caveats
- The data here is correct as downloaded from dates between 16th and 19th May 2016. If enough people are interested, I’ll set up a regular process to keep these Excel files complete and up to date.
- This is not all documents listed by IATI, only about 95% of them. Those omitted were inaccessible or malformed.
- To the best of my understanding all this data is licensed such that I’m allowed to redistribute in this form, but if you’re going to use it, make sure you abide by any licensing information provided.
Finally – please double check any data with the original IATI source data before drawing any inflammatory conclusions – I make no guarantees the data conversion process is error-free.