<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
    <title>HubLog</title>
    <link rel="alternate" type="text/html" href="http://hublog.hubmed.org/" />
    <link rel="self" type="application/atom+xml" href="http://hublog.hubmed.org/atom.xml" />
    <id>tag:hublog.hubmed.org,2009://2</id>
    <updated>2009-06-18T13:36:17Z</updated>
    <subtitle>DROP ALL DATABASES;</subtitle>
    <author><name>Alf Eaton</name></author>
    
    <entry>
      <title>Entities in Scientific News Stories</title>
      <link rel="alternate" type="text/html" href="http://hublog.hubmed.org/archives/001867.html" />
      <id>tag:hublog.hubmed.org,2009://2.1867</id>
      <published>2009-06-18T13:36:17Z</published>
      <updated>2009-06-18T13:36:17Z</updated>
      <summary>I ran the text of Guardian articles categorised as &apos;science&apos; (full text), New York Times articles categorised as &apos;science and technology&apos; (short sections) and Nature News articles (full text) through OpenCalais to see what entities it identified. Here are the results....</summary>
      
      
      <content type="html" xml:lang="en" xml:base="http://hublog.hubmed.org/">
          <![CDATA[<p>I ran the text of <a href="http://www.guardian.co.uk/science">Guardian articles categorised as 'science'</a> (full text), <a href="http://topics.nytimes.com/topics/news/science/topics/science_and_technology/">New York Times articles categorised as 'science and technology'</a> (short sections) and <a href="ttp://www.nature.com/news/">Nature News articles</a> (full text) through <a href="http://www.opencalais.com/">OpenCalais</a> to see what entities it identified. </p>

<p><a href="http://alf.hubmed.org/opencalais/science/">Here are the results</a>.</p>]]>
          
      </content>
    </entry>
    
    <entry>
      <title>onabus.com</title>
      <link rel="alternate" type="text/html" href="http://hublog.hubmed.org/archives/001866.html" />
      <id>tag:hublog.hubmed.org,2009://2.1866</id>
      <published>2009-06-16T11:56:09Z</published>
      <updated>2009-06-16T11:56:19Z</updated>
      <summary>I&apos;ve reopened onabus.com, mostly because the excellent London Bus iPhone app now exists and has everything except maps of the routes (presumably just waiting for MapKit). Also London now has StreetView images, so maximising the info window to see the stop actually works. The mobile version sort-of works, but it&apos;s a bit slow and unfinished....</summary>
      
      
      <content type="html" xml:lang="en" xml:base="http://hublog.hubmed.org/">
          <![CDATA[<p>I've reopened <a href="http://onabus.com">onabus.com</a>, mostly because the excellent <a href="http://mbarclay.net/?page_id=193">London Bus iPhone app</a> now exists and has everything except maps of the routes (presumably just waiting for MapKit).</p>

<p>Also London now has StreetView images, so maximising the info window to see the stop actually works.</p>

<p><a href="http://onabus.com/m/">The mobile version</a> sort-of works, but it's a bit slow and unfinished.</p>]]>
          
      </content>
    </entry>
    
    <entry>
      <title>Annotation of Scientific Articles</title>
      <link rel="alternate" type="text/html" href="http://hublog.hubmed.org/archives/001865.html" />
      <id>tag:hublog.hubmed.org,2009://2.1865</id>
      <published>2009-06-14T16:40:23Z</published>
      <updated>2009-06-14T16:48:48Z</updated>
      <summary>I made a web-based interface for curating the results of entity extraction from scientific papers. It converts XML files to text, passes the text through machine annotators, lets curators add/delete/modify the annotations, then splices the annotations back into the original XML file. I can&apos;t show it publically yet, but here&apos;s a screenshot. Google Wave will be ideal for this, as at the moment only one person can edit at a time....</summary>
      
      <category term="annotation" label="annotation" scheme="http://www.sixapart.com/ns/types#tag" />
      
      <content type="html" xml:lang="en" xml:base="http://hublog.hubmed.org/">
          <![CDATA[<p>I made a web-based interface for curating the results of entity extraction from scientific papers.</p>

<p>It converts XML files to text, passes the text through machine annotators, lets curators add/delete/modify the annotations, then splices the annotations back into the original XML file.</p>

<p>I can't show it publically yet, but here's a screenshot.</p>

<p><a href="/files/2009-06-14-annotation.png"><img width="800px" src="/files/2009-06-14-annotation.png"></a></p>

<p>Google Wave will be ideal for this, as at the moment only one person can edit at a time.</p>]]>
          
      </content>
    </entry>
    
    <entry>
      <title>Now Playing in Songbird</title>
      <link rel="alternate" type="text/html" href="http://hublog.hubmed.org/archives/001864.html" />
      <id>tag:hublog.hubmed.org,2009://2.1864</id>
      <published>2009-06-14T15:55:59Z</published>
      <updated>2009-06-14T16:24:37Z</updated>
      <summary>I ported my Now Playing wall to use Songbird&apos;s Webpage API, instead of XMPP. It doesn&apos;t do as much as the old version did, because it&apos;s harder to query/control Songbird than Amarok, but it works pretty well nevertheless. Live version (load in Songbird while playing a track; you&apos;ll need to give it permission to access your library). Screenshot: An important part of a recommendation/exploration tool like this is knowing what&apos;s already in your local library (though admittedly Spotify is quickly making that irrelevant). I&apos;m integrating it with an Ampache/MySQL database, using Greasemonkey, but obviously that won&apos;t work on the web...</summary>
      
      <category term="music" label="music" scheme="http://www.sixapart.com/ns/types#tag" />
      <category term="nowplaying" label="now-playing" scheme="http://www.sixapart.com/ns/types#tag" />
      <category term="songbird" label="songbird" scheme="http://www.sixapart.com/ns/types#tag" />
      
      <content type="html" xml:lang="en" xml:base="http://hublog.hubmed.org/">
          <![CDATA[<p>I ported <a href="http://hublog.hubmed.org/archives/001629.html">my Now Playing wall</a> to use <a href="http://wiki.songbirdnest.com/Developer/Developer_Intro/Webpage_API">Songbird's Webpage API</a>, instead of XMPP. It doesn't do as much as the old version did, because it's harder to query/control Songbird than Amarok, but it works pretty well nevertheless.</p>

<p><a href="http://alf.hubmed.org/songbird-now-playing/">Live version</a> (load in Songbird while playing a track; you'll need to give it permission to access your library).</p>

<p>Screenshot:<br />
<a href="/files/2009-06-14-now-playing-songbird.png"><img width="800px" src="/files/2009-06-14-now-playing-songbird.png"/></a></p>

<p>An important part of a recommendation/exploration tool like this is knowing what's already in your local library (though admittedly Spotify is quickly making that irrelevant). I'm integrating it with an <a href="http://ampache.org/">Ampache</a>/MySQL database, using Greasemonkey, but obviously that won't work on the web for everyone.</p>]]>
          
      </content>
    </entry>
    
    <entry>
      <title>A Private Radio Archive</title>
      <link rel="alternate" type="text/html" href="http://hublog.hubmed.org/archives/001863.html" />
      <id>tag:hublog.hubmed.org,2009://2.1863</id>
      <published>2009-06-14T15:46:16Z</published>
      <updated>2009-06-14T16:59:42Z</updated>
      <summary>I made a rolling, three day audio archive of Resonance FM, by recording the MP3 audio stream in 30 minute chunks then using the schedule (which is helpfully in a Google Calendar) to match the files to the programmes, tag the MP3 files and produce an index. I can&apos;t let anyone else use it yet, because of copyrights, but here&apos;s a screenshot:...</summary>
      
      <category term="notube" label="notube" scheme="http://www.sixapart.com/ns/types#tag" />
      <category term="radio" label="radio" scheme="http://www.sixapart.com/ns/types#tag" />
      
      <content type="html" xml:lang="en" xml:base="http://hublog.hubmed.org/">
          <![CDATA[<p>I made a rolling, three day audio archive of <a href="http://resonancefm.com/">Resonance FM</a>, by recording the MP3 audio stream in 30 minute chunks then using the <a href="http://resonancefm.com/schedule">schedule</a> (which is helpfully in a Google Calendar) to match the files to the programmes, tag the MP3 files and produce an index.</p>

<p>I can't let anyone else use it yet, because of copyrights, but here's a screenshot:</p>

<p><a href="/files/2009-06-14-resonance.png"><img width="800px" src="/files/2009-06-14-resonance.png"></a></p>]]>
          
      </content>
    </entry>
    
    <entry>
      <title>Dealing with election results data</title>
      <link rel="alternate" type="text/html" href="http://hublog.hubmed.org/archives/001862.html" />
      <id>tag:hublog.hubmed.org,2009://2.1862</id>
      <published>2009-06-11T13:20:53Z</published>
      <updated>2009-06-13T09:35:19Z</updated>
      <summary>The Guardian produced a set of data that they&apos;d collected for the results of the recent European elections, and published the data as a Google Spreadsheet. I cloned the spreadsheet and tidied it up (HTML version), then imported it into Google Fusion Tables. In Fusion Tables I created two separate views of the data - one showing just the number of votes for each party, and one showing the % of votes for each party. From the Visualize menu, anyone should now be able to visualise that data in different ways: currently the sortable Table, Scatter and Bar visualisations are...</summary>
      
      
      <content type="html" xml:lang="en" xml:base="http://hublog.hubmed.org/">
          <![CDATA[<p>The Guardian <a href="http://www.guardian.co.uk/news/datablog/table/2009/jun/09/european-elections-elections-2009">produced</a> a set of data that they'd collected for the results of the recent European elections, and published the data as <a href="http://spreadsheets.google.com/ccc?key=rmJhQNA_Mm0w4pNl0-2QSiw">a Google Spreadsheet</a>. </p>

<p>I <a href="http://spreadsheets.google.com/ccc?key=rn2wKXhx8qswQMex3YhViqA">cloned the spreadsheet and tidied it up</a> (<a href="http://spreadsheets.google.com/pub?key=rn2wKXhx8qswQMex3YhViqA">HTML version</a>), then <a href="http://tables.googlelabs.com/DataSource?dsrcid=11656/11656">imported it into Google Fusion Tables</a>.</p>

<p>In Fusion Tables I created two separate views of the data - one showing just <a href="http://tables.googlelabs.com/DataSource?dsrcid=11713/11713">the number of votes for each party</a>, and one showing <a href="http://tables.googlelabs.com/DataSource?dsrcid=11838/11838">the % of votes for each party</a>. </p>

<p>From the Visualize menu, anyone should now be able to visualise that data in different ways: currently the sortable Table, Scatter and Bar visualisations are the most interesting; the Intensity Map would be good, but doesn't yet have enough options to present this data well.</p>

<p><a href="http://www.flickr.com/photos/alf/3616814740/sizes/l/">Google's attempt to automatically geocode the location fields</a> is interesting; there needs to be an option to limit the scope of the geocoding, perhaps.</p>

<p><script src="http://www.gmodules.com/ig/ifr?url=http://www.google.com/ig/modules/bar-chart.xml&up__table_query_url=http://tables.googlelabs.com/gvizdata?tq=select+col0%252Ccol7%252Ccol9%252Ccol11%252Ccol13%252Ccol15%252Ccol17%252Ccol19%252Ccol21%252Ccol23%252Ccol25%252Ccol27%252Ccol31%252Ccol33%252Ccol53%252Ccol55+from+11713+where+col0%253D'HACKNEY'&up__table_query_refresh_interval=0&w=800&h=600&border=%23ffffff%7C3px%2C1px+solid+%23eeeeee&output=js"></script></p>]]>
          
      </content>
    </entry>
    
    <entry>
      <title>Adding Bing search results to Google</title>
      <link rel="alternate" type="text/html" href="http://hublog.hubmed.org/archives/001861.html" />
      <id>tag:hublog.hubmed.org,2009://2.1861</id>
      <published>2009-06-04T12:41:28Z</published>
      <updated>2009-06-04T14:36:16Z</updated>
      <summary>A Greasemonkey script (configurable to select which search engine&apos;s results to add) which adds a Google (OpenSocial) Gadget that uses Google&apos;s AJAX Feeds API to fetch an RSS (ideally OpenSearch) feed of search results as JSON and inserts them in a floating box on the side of Google search results pages. Inspired by DeeperWeb, which looks to be built entirely using Google&apos;s AJAX Feeds API and Custom Search Engines (plus a little bit of extra code to generate the tag clouds from the search results). Install the Greasemonkey script then try a Google search (note: you might have to search...</summary>
      
      
      <content type="html" xml:lang="en" xml:base="http://hublog.hubmed.org/">
          <![CDATA[<p>A <a href="http://alf.hubmed.org/gadgets/opensearch/add_opensearch_results_g.user.js">Greasemonkey script</a> (configurable to select which search engine's results to add) which adds a <a href="http://code.google.com/apis/gadgets/">Google (OpenSocial) Gadget</a> that uses Google's <a href="http://code.google.com/apis/ajaxfeeds/">AJAX Feeds API</a> to fetch an RSS (ideally <a href="http://www.opensearch.org/">OpenSearch</a>) feed of search results as JSON and inserts them in a floating box on the side of Google search results pages.

<p>Inspired by <a href="http://www.deeperweb.com/">DeeperWeb</a>, which looks to be built entirely using Google's AJAX Feeds API and <a href="http://www.google.com/coop/cse/">Custom Search Engines</a> (plus a little bit of extra code to generate the tag clouds from the search results).

<p>Install the Greasemonkey script then try a Google search (note: you might have to search from Firefox's search bar, as Google have started using AJAX-y searches from the front page that don't use query strings) - the extra results should show up in the top-right-hand corner.

<p>Note that it sends your search query to my server to fetch the results, so better disable the Greasemonkey script once you've tried it out.]]>
          
      </content>
    </entry>
    
    <entry>
      <title>Extracting keyphrases from documents using MeSH terms and KEA</title>
      <link rel="alternate" type="text/html" href="http://hublog.hubmed.org/archives/001860.html" />
      <id>tag:hublog.hubmed.org,2009://2.1860</id>
      <published>2009-06-01T14:30:19Z</published>
      <updated>2009-06-01T17:56:51Z</updated>
      <summary>KEA extracts keyphrases from a set of documents. The README covers most of this. Create a folder called &apos;train&apos; and, for each document in the training set, create a file with extension &quot;.txt&quot; containing the text of the document and a file with extension &quot;.key&quot; containing the known MeSH terms for this document (one per line). Create a folder called &apos;test&apos; and, for each document in the test set, create a file with extension &quot;.txt&quot; containing the text of the document. Download and extract KEA. Fetch meshdata.rdf (a SKOS representation of the MESH hierarchy) and put it in the VOCABULARIES...</summary>
      
      
      <content type="html" xml:lang="en" xml:base="http://hublog.hubmed.org/">
          <![CDATA[<a href="http://www.nzdl.org/Kea/">KEA</a> extracts keyphrases from a set of documents. The <a href="http://www.nzdl.org/Kea/Download/Kea-5.0-Readme.txt">README</a> covers most of this.

<ol>
<li>Create a folder called 'train' and, for each document in the training set, create a file with extension ".txt" containing the text of the document and a file with extension ".key" containing the known MeSH terms for this document (one per line).
<li>Create a folder called 'test' and, for each document in the test set, create a file with extension ".txt" containing the text of the document.
<li>Download and extract <a href="http://www.nzdl.org/Kea/download.html">KEA</a>. Fetch <a href="http://thesauri.cs.vu.nl/eswc06/mesh/rdf/meshdata.rdf">meshdata.rdf</a> (a SKOS representation of the MESH hierarchy) and put it in the VOCABULARIES directory.
<li>From within the downloaded KEA folder, set up some environment variables:
<pre><code>export KEAHOME=`pwd`
export CLASSPATH=$CLASSPATH:$KEAHOME:$KEAHOME/lib/commons-logging.jar:$KEAHOME/lib/icu4j_3_4.jar:$KEAHOME/lib/iri.jar\
:$KEAHOME/lib/jena.jar:$KEAHOME/lib/snowball.jar:$KEAHOME/lib/weka.jar:$KEAHOME/lib/xercesImpl.jar:$KEAHOME/lib/kea-5.0.jar
</code></pre>
<li>Build the model:
<br><tt>java -Xmx512M kea.main.KEAModelBuilder -l /path/to/training/folder -m articles -v meshdata -f skos -t NoStemmer</tt>
<li>Run KEA against the test set, using the model built above:
<br><tt>java -Xmx512M kea.main.KEAKeyphraseExtractor -l /path/to/test/folder -m articles -v meshdata -f skos -t NoStemmer -n 10</tt>
<li>There should now be a set of ".key" files in the test folder, containing key phrases corresponding to each of the test documents.
</ol>

<p>In theory that should be enough, but I'm getting an error when KEA reads in the SKOS vocabulary. It seems to at least work with <tt>-v none</tt> for now, which doesn't use the vocabulary.]]>
          
      </content>
    </entry>
    
    <entry>
      <title>Scraping with YQL Execute</title>
      <link rel="alternate" type="text/html" href="http://hublog.hubmed.org/archives/001859.html" />
      <id>tag:hublog.hubmed.org,2009://2.1859</id>
      <published>2009-06-01T07:29:20Z</published>
      <updated>2009-06-01T07:37:46Z</updated>
      <summary>Another attempt at hosted, server-side scraping in Javascript, this time using YQL Execute, which is mostly based around E4X and XPath. It takes the idea from one of the Execute demo tables to use a CSS2XPath library to convert CSS selectors into XPath (the library handles most selectors well, though not the very newest, like nth-of-type). This allows the selectors to be written using CSS or XPath, which is enough for a lot of cases (but might still have to expanded to allow regular expressions). Here&apos;s an example definition file, and its output....</summary>
      
      <category term="scraping" label="scraping" scheme="http://www.sixapart.com/ns/types#tag" />
      
      <content type="html" xml:lang="en" xml:base="http://hublog.hubmed.org/">
          <![CDATA[<p>Another attempt at hosted, server-side scraping in Javascript, this time using <a href="http://developer.yahoo.com/yql/guide/yql-execute-chapter.html">YQL Execute</a>, which is mostly based around E4X and XPath.</p>

<p>It takes the idea from one of the Execute demo tables to use a <a href="http://code.google.com/p/css2xpath/">CSS2XPath</a> library to convert CSS selectors into XPath (the library handles most selectors well, though not the very newest, like nth-of-type). This allows the selectors to be written using CSS or XPath, which is enough for a lot of cases (but might still have to expanded to allow regular expressions).</p>

<p>Here's <a href="http://alf.hubmed.org/datatables/defs/the-big-picture.js">an example definition file</a>, and its <a href="http://developer.yahoo.com/yql/console/?q=use%20%22http%3A%2F%2Falf.hubmed.org%2Fdatatables%2Fcss-select.xml%22%20as%20selector%3B%20select%20*%20from%20selector%20where%20defs%3D%22http%3A%2F%2Falf.hubmed.org%2Fdatatables%2Fdefs%2Fthe-big-picture.js%22">output</a>.</p>]]>
          
      </content>
    </entry>
    
    <entry>
      <title>Clustering documents with CLUTO</title>
      <link rel="alternate" type="text/html" href="http://hublog.hubmed.org/archives/001858.html" />
      <id>tag:hublog.hubmed.org,2009://2.1858</id>
      <published>2009-05-28T14:47:00Z</published>
      <updated>2009-05-28T14:48:47Z</updated>
      <summary>After getting a local copy of the metadata for around 360,000 articles, I wanted to use some clustering/topic modelling to divide them into categories for browsing. CLUTO was suggested, and it worked pretty well (though it&apos;s only working off the document titles so far, so there isn&apos;t much opportunity for semantic analysis - just word matching). Export all the documents to a file with one document per line. From a MySQL table of document titles, this is as simple as SELECT title FROM articles then export as CSV. Use doc2mat to convert the list of documents into a term matrix:...</summary>
      
      
      <content type="html" xml:lang="en" xml:base="http://hublog.hubmed.org/">
          <![CDATA[After getting <a href="http://www.nature.com/oai/">a local copy of the metadata for around 360,000 articles</a>, I wanted to use some clustering/topic modelling to divide them into categories for browsing. <a href="http://glaros.dtc.umn.edu/gkhome/cluto/cluto/overview">CLUTO</a> was suggested, and <a href="http://alf.hubmed.org/nature-browse/clusters.php">it worked pretty well</a> (though it's only working off the document titles so far, so there isn't much opportunity for semantic analysis - just word matching).

<ol>
<li>Export all the documents to a file with one document per line. From a MySQL table of document titles, this is as simple as <tt>SELECT title FROM articles</tt> then export as CSV.
<li>Use <tt>doc2mat</tt> to convert the list of documents into a term matrix: <tt>doc2mat -nostem titles.csv titles.mat</tt>
<li>Run <tt>vcluster</tt> on the matrix to produce 1000 clusters, and ask it to suggest distinctive features and summaries of each cluster:<br>
<tt>vcluster -showfeatures -nfeatures 20 -showsummaries cliques titles.mat 1000 > clusters.txt</tt>
<li><a href="http://alf.hubmed.org/cluto/clusters.phps">Parse the clusters file</a> and generate SQL statements for inserting the clusters back into the original database.
<li>Copy the cluster features section of the vcluster output into a new file, and <a href="http://alf.hubmed.org/cluto/cluster-features.phps">parse it to extract the clusters and their features</a> (there is a libcluto available, and <a href="http://search.cpan.org/~ihara/Statistics-Cluto/">a Perl library</a>, but regular expression parsing was easy enough). <br>This generates a set of SQL statements for a "clusters" table.
<li>Run the SQL statements to import the clusters data: <tt>mysql -u USER -p DATABASE &lt; clusters.sql; mysql -u USER -p DATABASE &lt; cluster-features.sql</tt>
</ol>

Problems with this method:
<ul>
<li>Each document only gets assigned to one cluster, whereas ideally they could be placed in multiple categories with scores for each category.
<li>There's no immediate way to add new documents to existing clusters, without using a separate tool.
<li>CLUTO isn't open source.
</ul>]]>
          
      </content>
    </entry>
    
</feed>
