Converting public safety data to RDF
From Wiki
Contents |
Introduction
The aim of this project is to visualize different events related to public safety that are of interest of the people form Troy, NY. For the purpose of the conversion of the data, we consider the use of a Linux system that includes several applications, tools and interpretes, such as pdftotext, cat, and Perl, just to name a few. We created two subfolder to treat independently the information from Troy Police Dept. and RPI Public Safety.
/(root) | |-RPI |-TPD
Derivables
There are several derivable from this work
Demo
The demo is available at http://graves.cl/public_safety/
Code
The scripts used for converting the data are available at Image:Public safety scripts.tar.gz
Data
The data can be found (in RDF/XML format) at Image:Public safety data.rdf.gz
Step 1: Obtaining the data and parsing it into text
Depending upon the source of the data we may have to situations: One for the RPI Public Safety data and other for the Troy Police Department data.
RPI public safety data
- Download PDF files from RPI public safety dept.
- In linux, use pdftotext
pdftotext -layout file.pdf file.txt
The files will be stored using the following convention: From the RPI directory, there will be three subfolders for pdf's, their txt version and the rdf representation. Each subfolder contains two subfolders: One for the monthly reports and one for the yearly reports. It is important to note that the 2008 yearly report is compatible with the monthly format, so it is considered another month too.
/(RPI)
|-pdf
| |-month
| |-year
|
|-txt
| |-month
| |-year
|
|-rdf
|-month
|-year
Troy PD
We obtained an Excel file with all the information from Troy PD. We converted into CSV using Excel, OpenOffice or similar. Lets call the CSV file tpd.csv
Step 2: Converting to RDF
In order to express our data in RDF, we decided that we needed a common ontology: This ontology, should be flexible enough to support both datasources as well as allow future potential sources too. The ontology can be found at Image:PS ontology.owl.gz.
For both conversions, we will need to use two files: LocationFile, which contains the geocoordinates of different places (such as "Winslow Building, RPI" or "15th street and Hoosick ave., Troy, NY 12180". These coordinates were obtained using the Google Maps API. Also, it is needed a CodeFile, that translate the categories of events from RPI and TroyPD to our new ontology.
Is important to note that even though the collection of geocoordinates was automatized, there were several cases that neede to be included by hand. To do these, we run the scripts mentioned below: If they failed to find the geocoordinates for a specific location, the script stopped indicating which location was missing. After that, we searched that location on Google and RPI's website to obtain the proper geocoordinates. After that, we ran the scripts again. Finally the geocoordinates, although not always 100% accurate, reflect positions very close to the actual geocoordinates (usually the error is less than 20 meters). I think that for the purposes of the collection of this data, is an acceptable error.
RPI public_safety data
Let's assume, each monthly report is MON_09.pdf.txt where MON is the first 3 letters of a month.
for i in `ls txt/month/*.txt` do cat $i | scripts/parse_monthly_report.pl data/Locations data/TotalIncidents 2>>log_conversion > rdf/month/`basename $i`.rdf done
Now, for the yearly reports is similar:
for i in `ls txt/year/*.txt` do cat $i | scripts/parse_yearly_report.pl data/Locations data/TotalIncidents 2>>log_conversion > rdf/year/`basename $i`.ntriples done
Troy PD
After converting to CSV, apply a Perl script called parse_troy_csv.pl, which takes to inputs: the LocationFile and the CodeFile for geocoordinates and categories respectively.
cat tpd.csv | parse_troy_csv.pl LocationFile CodeFile > tpd.ntriples
Step 3: Consolidating and storing all the data
First, we need to consolidate the RPI in a single file
RPI public safety data
From /rpi directory we run
cat rdf/month/* rdf/year/* >> rdf/rpi.ntriples
Conversion of ontology to ntriples
We developed the ontology using Protege editor, which does not allow to save in ntriple format. So first we need to convert the ontology to that using Raptor RDF Parser Library
rapper -i rdfxml ontology.owl > ontology.ntriples
Merging all the data in a single file
Finally, we want to have all the data in one file. Assuming that the ontology file is in the root directory, the Troy PD data is in /tpd directory and RPI public safety data is in /rpi directory, we run
cat ontology.ntriples tpd/tpd.ntriples rpi/rdf/rpi.ntriples > data.ntriples
Finally we may want to convert the data to RDF/XML format
rapper -i ntriples -o rdfxml-abbrev data.ntriples \\
-f 'xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"' \\
-f 'xmlns="http://data-gov.tw.rpi.edu/public_safety/ontology.owl#"' \\
-f 'xmlns:map="http://map.rpi.edu/index.php/Special:URIResolver/Property-3A/"' \\
-f 'xmlns:owl="http://www.w3.org/2002/07/owl#"' \\
-f 'xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"' > data.rdf
Storing the data in a triplestore
Finally, it is possible to put the data in a triplestore using the SPARUL and SPARQL+ command LOAD.
Stewardship
I think this workflow and the annotations allows people to understand what was done and how was done. It indicates the sources of the information, the decisions made, a description of the procedures as well as the code used and shows the final data product (before being loadad into a triple store).

