Getting a local copy of MEDLINE

First of all you have to get a licence from the NLM, which allows your IP address to access the MEDLINE FTP server.

This PHP script will fetch all the most up-to-date MEDLINE XML files, then create a folder (and some sub-folders) full of several million gzipped, serialized PHP files, each one containing an individual article accessible by PMID.

Note: Edit the $base_dir variable to point to an existing directory before running the script.
Note: This script doesn't handle updates.

Update: now using json_encode() instead of serialize() if you want to store a parsed version of the file, as you can't serialize SimpleXML objects. Not sure what speed benefit json_decode() will give over simplexml_load_file() though.

// configuration
$base_dir = '/PATH/TO/medline'; // EDIT THIS to point to an existing, empty directory

$output_dir = "$base_dir/pmid";
$input_dir = "$base_dir/files";

$item_start = '<MedlineCitation '; // defines the start of each item
$id_regexp = '/<PMID>(\d+)<\/PMID>/'; // unique identifier used for the filename

// fetch MEDLINE (needs NLM license)
system("cd $input_dir; /usr/bin/lftp -e 'o ftp://ftp.nlm.nih.gov/nlmdata/.medleasebaseline/gz && mirror --verbose && quit'; /usr/bin/lftp -e 'o ftp://ftp.nlm.nih.gov/nlmdata/.medlease/gz && mirror --verbose && quit'");

// create output directories
mkdir($output_dir);
foreach (range(0,200) as $i){
  mkdir("$output_dir/$i");
}

// process all files ending in .xml.gz
foreach (glob("$input_dir/*.xml.gz") as $file){  
  // not-particularly-strict stream parsing of large xml files 
  $handle = gzopen($file, 'r');
  print "Processing $file\n";

  while (!feof($handle)) {	
    $line = fgets($handle);

    if (ereg($item_start, $line)) {
      if (isset($id)){
        $i = ceil($id/100000);
        $output_file = "$output_dir/$i/$id.xml";
        // save the individual article data, gzipped
        //file_put_contents("compress.zlib://$output_file", json_encode(simplexml_load_string(implode('', $output))));
        file_put_contents("compress.zlib://$output_file", implode('', $output));
      }
      $output = array();
      unset($id);
    }
    if (preg_match($id_regexp, $line, $matches)){
      $id = $matches[1];
    }

    $output[] = $line;
  }
}

Here's an example function for reading a file by PMID:


function read_medline_file($pmid){
  global $output_dir;
  $i = ceil($pmid/100000);
  $file = "$output_dir/$i/$pmid.xml";
  //return json_decode(file_get_contents("compress.zlib://$file"));
  return simplexml_load_file("compress.zlib://$file");
}

Comments

Excellent!

I like!
And there's any way to store the PMID->Title information?

I'm interested in why you wanted a local copy of medline. What are you going to do with it?

Once you have a local copy of MEDLINE, the possibilities are endless!

(I'll probably write a follow-up post for iterating through all the individual files and adding fields to a database).

Leandro: parse the XML with SimpleXML, pick out the PMID and ArticleTitle fields, then save them to your database. For speed I'd recommend writing the fields to a file using fputcsv, then loading them into MySQL with LOAD DATA INFILE.

All fields are optional, email address will not be shown; no HTML, URLs are automatically hyperlinked.