Currently I am developing a web-application on a NoSQL database that would query hundreds of thousands of structured (XML) files. This is going to be an analytical tool to get your search results back depending on XPaths. One assumption is that non-technical teams may use and they are note necessarily are good in anything-xml. To help them out, I hid all “nastiness” behind HTML/JQuery.
For example, let’s have this as HTML:
<p>Insert some pre-defined query templates to insert field</p>
<input type="text"name="xpath"id="xpath-field" />
Query templates:<br />
<li value="//remotelink"class="xpath-choice">Find all remotelink elements</li>
<li value="//lnb-prec:title"class="xpath-choice">Find first (main) precedent title</li>
<li value="//remotelink[@hrefclass='mailto']"class="xpath-choice">Find all "e-mail to" remotelinks</li>
Jquery (it would need to be loaded on document load):
One of the problems of analysing structured (XML’y) data is efficiency vs accuracy problem: would you be 100% accurate or fast? This is where NoSQL databases can be helpful if used wisely with Indexes.
I have been given a problem where application needed to be developed that would crawl through hundreds of thousands of XML files and bring back accurate XML results. This is a challenge in it’s own domain and first questions that would pop to anyone’s minds are these:
what technology can I use
how big and how many files is a data set made of
how complex XPaths are
how efficiently could I remove as much non-necessary data
are false-positives an issue
If some of these questions you asked yourself, code below may help you out.
An efficient algorithm to crawl through XML using XPath
First, some theory and algorithmic ideas along with how this can be implemented on MarkLogic. Let’s create a hypothetical problem: we have hundreds of thousands of files that are never bigger than 2 MB in size. All of our files are in a well-formed XML format. We may or may not about existing schemas of these structured files along with their namespaces. Let us also assume that we classify at high level using collections. As for underlying technology, we assume all our data is in the same database with some of built-in indexes enabled.
With these assumptions in place, we could start thinking about the best way to implement this. As most of the time, best search algorithms try to remove the biggest chunk of data first. This makes a total sense – if you classified your data well using collections, you would be able to re-use MarkLogic’s collection index, therefore removing data we are not interested quickly and efficiently.
It is also worth to note that we do not expand (open XML file(-s) to run an XPath) until the last critical moment: this means we only try filter data using indexes. Let us take a look at diagram below:
As you can see, it is quite simple. Only two main phases that are very different. On the left hand side, we have an efficient way of filtering data using indexes. On the other side, we have slow but accurate run of XPaths on every single document in a database.
So, let’s start with the first step: this algorithm relies on cts:uris() function that does an intersection on all document URIs that fit specific query. Firstly, it is collection intersection. You may think of this as Venn Diagrams – pretty simple, eh?
After that, where an interesting part starts. What we try to do is to bring only those document uris that have specific elements in any order in a document. Why? so we remove any documents from final result that do not have a specific element from XPath that we are querying as it would not be a successful result back anyway.
Last, but not least, we just go through all files and expand them using a list of cts:uris() and run them against XPath.
By default, MarkLogic is installed in /opt/MarkLogic directory whereas forest (data) is store at /var/opt/MarkLogic directory. Latter was not it was intended to be for my configuration: Data should have been stored on different drive, therefore symlink needed to be created.
Sometimes it is necessary to rewrite URLs that we pass to our web server. This is sometimes is called a “clean URL” and is more human readable. Furthermore, some search engine web crawlers prefer having clean urls rather than URLs with many different parameters.
In order to turn this feature on apache, install “mod_rewrite” module. Once this is done, run:
I have been playing around with some php and mysql modules on my Apache2 test server. Unfortunately, I was constantly getting random errors until I looked at apache error.log file. This is how it looked:
[Mon Aug 19 15:12:15 2013] [error] [client 10.0.235.4] PHP Fatal error: Allowed memory size of 134217728 bytes exhausted (tried to allocate 78 bytes) in /path/to/your/system/system.install on line 1187,Continue reading →