UtterAccess HomeUtterAccess Wiki

Welcome Guest ( Log In | Register )

Custom Search
Edit Discussion
> XML and Access: Working with Huge Files    

http://www.utteraccess.com/wiki/XML_and_Access:_Working_with_Huge_Files

XML and Access: Working with Huge Files

Contents

Generalities

Working with large XML files is a challenge for the following main reasons:

  • Using the DOM API, RAM requirements are ~5 times the size of the file.
  • Few text editors are able to open huge XML files.

That means that most applications will crash when loading the file. Therefore, streaming techniques are the choice in this situation. Three streaming techniques were tested for writing down this article:

  • FSO
  • ADO
  • SAX

Each one has its strengths and weaknesses, and most probably a developer will need to use one or another for different tasks in the context of very large xml files. All the examples in this article are based upon the Medical Subject Headings database, property of the U.S. National Library of Medicine available from here http://www.nlm.nih.gov/mesh/

FSO

Read First Lines

The first step when dealing with huge xml files is to get an idea about how the xml document looks like. This is a pretty standard technique when handling large data, and other languages like R have native functions for that purpose. In Access we can use the file system object for accomplishing the same results. This function shows how http://www.utteraccess.com/wiki/index.php/FirstLines

CODE

<?xml version="1.0"?>
<dataroot xmlns:od='urn:schemas-microsoft-com:officedata' xmlns:xsi='http://www.w3.org/2001/XMLSchema-instance' xsi:noNamespaceSchemaLocation='C:\MyPath\MyXSDFile.xsd'>
<SupplementalRecordSet LanguageCode = "eng">
<SupplementalRecord SCRClass = "1">
 <SupplementalRecordUI>C000002</SupplementalRecordUI>
 <SupplementalRecordName>
  <String>bevonium</String>
 </SupplementalRecordName>
 <DateCreated>
  <Year>1971</Year>
  <Month>01</Month>

Read Last Lines

Additionally, one might want to read the last lines of the xml. That can be done by using the LastLines function http://www.utteraccess.com/wiki/index.php/LastLines

CODE

      <Day>02</Day>
     </DateCreated>
     <ThesaurusIDlist>
      <ThesaurusID>NLM (2013)</ThesaurusID>
     </ThesaurusIDlist>
    </Term>
   </TermList>
  </Concept>
 </ConceptList>
</SupplementalRecord>
</SupplementalRecordSet>
</dataroot>

Getting Sense of the Main Structure

By comparing the first and last lines of the document it becomes evident that the element SupplementalRecord is the one that identifies each registry in the xml. Therefore, any further manipulation of the document should preserve the structure of the SupplementalRecord node and its child nodes.

Number of Nodes

An important data to know is the number of nodes we are interested in for further manipulation of the data. In this case, knowing how many SupplementalRecord nodes are in the document is of particular importance. For that the function CountNodes can be used http://www.utteraccess.com/wiki/index.php/CountNodes

CODE

Start counting:             10/27/2013 8:29:39 PM
End counting:               10/27/2013 8:31:52 PM
There are 219278 SupplementalRecord Nodes

Splitting the XML

Knowing the number of nodes of the node we are interested in, allow us to calculate the number of files and the number of nodes per file. For splitting a large xml into several smaler documents the function SplitXml can be used http://www.utteraccess.com/wiki/index.php/SplitXml

  • To be continued
Edit Discussion
Custom Search


Thank you for your support!
This page has been accessed 2,043 times.  This page was last modified 02:42, 28 October 2013 by genoma111.   Disclaimers