First of all you have to get a licence from the NLM, which allows your IP address to access the MEDLINE FTP server.
This PHP script will fetch all the most up-to-date MEDLINE XML files, then create a folder (and some sub-folders) full of several million gzipped, serialized PHP files, each one containing an individual article accessible by PMID.
Note: Edit the $base_dir variable to point to an existing directory before running the script.
Note: This script doesn't handle updates.
Update: now using json_encode() instead of serialize() if you want to store a parsed version of the file, as you can't serialize SimpleXML objects. Not sure what speed benefit json_decode() will give over simplexml_load_file() though.
// configuration
$base_dir = '/PATH/TO/medline'; // EDIT THIS to point to an existing, empty directory
$output_dir = "$base_dir/pmid";
$input_dir = "$base_dir/files";
$item_start = '<MedlineCitation '; // defines the start of each item
$id_regexp = '/<PMID>(\d+)<\/PMID>/'; // unique identifier used for the filename
// fetch MEDLINE (needs NLM license)
system("cd $input_dir; /usr/bin/lftp -e 'o ftp://ftp.nlm.nih.gov/nlmdata/.medleasebaseline/gz && mirror --verbose && quit'; /usr/bin/lftp -e 'o ftp://ftp.nlm.nih.gov/nlmdata/.medlease/gz && mirror --verbose && quit'");
// create output directories
mkdir($output_dir);
foreach (range(0,200) as $i){
mkdir("$output_dir/$i");
}
// process all files ending in .xml.gz
foreach (glob("$input_dir/*.xml.gz") as $file){
// not-particularly-strict stream parsing of large xml files
$handle = gzopen($file, 'r');
print "Processing $file\n";
while (!feof($handle)) {
$line = fgets($handle);
if (ereg($item_start, $line)) {
if (isset($id)){
$i = ceil($id/100000);
$output_file = "$output_dir/$i/$id.xml";
// save the individual article data, gzipped
//file_put_contents("compress.zlib://$output_file", json_encode(simplexml_load_string(implode('', $output))));
file_put_contents("compress.zlib://$output_file", implode('', $output));
}
$output = array();
unset($id);
}
if (preg_match($id_regexp, $line, $matches)){
$id = $matches[1];
}
$output[] = $line;
}
}
Here's an example function for reading a file by PMID:
function read_medline_file($pmid){
global $output_dir;
$i = ceil($pmid/100000);
$file = "$output_dir/$i/$pmid.xml";
//return json_decode(file_get_contents("compress.zlib://$file"));
return simplexml_load_file("compress.zlib://$file");
}