Getting a local copy of MEDLINE

·

First of all you have to get a licence from the NLM, which allows your IP address to access the MEDLINE FTP server.

This PHP script will fetch all the most up-to-date MEDLINE XML files, then create a folder (and some sub-folders) full of several million gzipped, serialized PHP files, each one containing an individual article accessible by PMID.

Note: Edit the $base_dir variable to point to an existing directory before running the script.
Note: This script doesn't handle updates.

Update: now using json_encode() instead of serialize() if you want to store a parsed version of the file, as you can't serialize SimpleXML objects. Not sure what speed benefit json_decode() will give over simplexml_load_file() though.

// configuration
$base_dir = '/PATH/TO/medline'; // EDIT THIS to point to an existing, empty directory
$output_dir = "$base_dir/pmid";
$input_dir = "$base_dir/files";
$item_start = '<MedlineCitation '; // defines the start of each item
$id_regexp = '/<PMID>(\d+)<\/PMID>/'; // unique identifier used for the filename
// fetch MEDLINE (needs NLM license)
system("cd $input_dir; /usr/bin/lftp -e 'o ftp://ftp.nlm.nih.gov/nlmdata/.medleasebaseline/gz && mirror --verbose && quit'; /usr/bin/lftp -e 'o ftp://ftp.nlm.nih.gov/nlmdata/.medlease/gz && mirror --verbose && quit'");
// create output directories
mkdir($output_dir);
foreach (range(0,200) as $i){
  mkdir("$output_dir/$i");
}
// process all files ending in .xml.gz
foreach (glob("$input_dir/*.xml.gz") as $file){  
  // not-particularly-strict stream parsing of large xml files 
  $handle = gzopen($file, 'r');
  print "Processing $file\n";
  while (!feof($handle)) {	
    $line = fgets($handle);
    if (ereg($item_start, $line)) {
      if (isset($id)){
        $i = ceil($id/100000);
        $output_file = "$output_dir/$i/$id.xml";
        // save the individual article data, gzipped
        //file_put_contents("compress.zlib://$output_file", json_encode(simplexml_load_string(implode('', $output))));
        file_put_contents("compress.zlib://$output_file", implode('', $output));
      }
      $output = array();
      unset($id);
    }
    if (preg_match($id_regexp, $line, $matches)){
      $id = $matches[1];
    }
    $output[] = $line;
  }
}

Here's an example function for reading a file by PMID:


function read_medline_file($pmid){
  global $output_dir;
  $i = ceil($pmid/100000);
  $file = "$output_dir/$i/$pmid.xml";
  //return json_decode(file_get_contents("compress.zlib://$file"));
  return simplexml_load_file("compress.zlib://$file");
}