Converting public safety data to RDF

From Wiki

Jump to: navigation, search

Contents

Introduction

The aim of this project is to visualize different events related to public safety that are of interest of the people form Troy, NY. For the purpose of the conversion of the data, we consider the use of a Linux system that includes several applications, tools and interpretes, such as pdftotext, cat, and Perl, just to name a few. We created two subfolder to treat independently the information from Troy Police Dept. and RPI Public Safety.

 /(root)
 |
 |-RPI
 |-TPD


Derivables

There are several derivable from this work

Demo

The demo is available at http://graves.cl/public_safety/

Code

The scripts used for converting the data are available at Image:Public safety scripts.tar.gz

Data

The data can be found (in RDF/XML format) at Image:Public safety data.rdf.gz

Step 1: Obtaining the data and parsing it into text

Depending upon the source of the data we may have to situations: One for the RPI Public Safety data and other for the Troy Police Department data.

RPI public safety data

  • In linux, use pdftotext
 pdftotext -layout file.pdf file.txt

The files will be stored using the following convention: From the RPI directory, there will be three subfolders for pdf's, their txt version and the rdf representation. Each subfolder contains two subfolders: One for the monthly reports and one for the yearly reports. It is important to note that the 2008 yearly report is compatible with the monthly format, so it is considered another month too.


 /(RPI)
 |-pdf
 |  |-month
 |  |-year
 |
 |-txt
 |  |-month
 |  |-year
 |
 |-rdf
    |-month
    |-year

Troy PD

We obtained an Excel file with all the information from Troy PD. We converted into CSV using Excel, OpenOffice or similar. Lets call the CSV file tpd.csv

Step 2: Converting to RDF

In order to express our data in RDF, we decided that we needed a common ontology: This ontology, should be flexible enough to support both datasources as well as allow future potential sources too. The ontology can be found at Image:PS ontology.owl.gz.


For both conversions, we will need to use two files: LocationFile, which contains the geocoordinates of different places (such as "Winslow Building, RPI" or "15th street and Hoosick ave., Troy, NY 12180". These coordinates were obtained using the Google Maps API. Also, it is needed a CodeFile, that translate the categories of events from RPI and TroyPD to our new ontology.

Is important to note that even though the collection of geocoordinates was automatized, there were several cases that neede to be included by hand. To do these, we run the scripts mentioned below: If they failed to find the geocoordinates for a specific location, the script stopped indicating which location was missing. After that, we searched that location on Google and RPI's website to obtain the proper geocoordinates. After that, we ran the scripts again. Finally the geocoordinates, although not always 100% accurate, reflect positions very close to the actual geocoordinates (usually the error is less than 20 meters). I think that for the purposes of the collection of this data, is an acceptable error.

RPI public_safety data

Let's assume, each monthly report is MON_09.pdf.txt where MON is the first 3 letters of a month.


 for i in `ls txt/month/*.txt`
   do 
   cat $i | scripts/parse_monthly_report.pl data/Locations data/TotalIncidents 2>>log_conversion > rdf/month/`basename $i`.rdf  
 done

Now, for the yearly reports is similar:

for i in `ls txt/year/*.txt`
  do 
  cat $i | scripts/parse_yearly_report.pl data/Locations data/TotalIncidents 2>>log_conversion > rdf/year/`basename $i`.ntriples  
done

Troy PD

After converting to CSV, apply a Perl script called parse_troy_csv.pl, which takes to inputs: the LocationFile and the CodeFile for geocoordinates and categories respectively.

 cat tpd.csv | parse_troy_csv.pl LocationFile CodeFile > tpd.ntriples

Step 3: Consolidating and storing all the data

First, we need to consolidate the RPI in a single file

RPI public safety data

From /rpi directory we run

 cat rdf/month/* rdf/year/* >> rdf/rpi.ntriples

Conversion of ontology to ntriples

We developed the ontology using Protege editor, which does not allow to save in ntriple format. So first we need to convert the ontology to that using Raptor RDF Parser Library

 rapper -i rdfxml ontology.owl > ontology.ntriples

Merging all the data in a single file

Finally, we want to have all the data in one file. Assuming that the ontology file is in the root directory, the Troy PD data is in /tpd directory and RPI public safety data is in /rpi directory, we run

 cat ontology.ntriples tpd/tpd.ntriples rpi/rdf/rpi.ntriples > data.ntriples

Finally we may want to convert the data to RDF/XML format

 rapper -i ntriples -o rdfxml-abbrev data.ntriples \\
     -f 'xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"' \\
     -f 'xmlns="http://data-gov.tw.rpi.edu/public_safety/ontology.owl#"' \\
     -f 'xmlns:map="http://map.rpi.edu/index.php/Special:URIResolver/Property-3A/"' \\
     -f 'xmlns:owl="http://www.w3.org/2002/07/owl#"' \\
     -f 'xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"'  > data.rdf


Storing the data in a triplestore

Finally, it is possible to put the data in a triplestore using the SPARUL and SPARQL+ command LOAD.

Stewardship

I think this workflow and the annotations allows people to understand what was done and how was done. It indicates the sources of the information, the decisions made, a description of the procedures as well as the code used and shows the final data product (before being loadad into a triple store).