Incorrect XPath returned by Article API
  • I think the Xpath returned from the Article API is incorrect.

    E.g. try the URL http://www.nytimes.com/2011/07/12/world/europe/12yard.html?_r=1&hp
    The Xpath returned is /HTML[1]/BODY[1]/DIV[1]/DIV[2]/DIV[3]/DIV[1]/DIV[1]/DIV[1]/DIV[1]/DIV[7]
    Now if you viewed the source at the URL you would see that

    /HTML[1]/BODY[1]/DIV[1] corresponds to

    As you can see that's an empty DIV hence anything that comes after /HTML[1]/BODY[1]/DIV[1] would be invalid. Hence when I use document.evaluate and pass the XPath returned by DiffBot I get a null.

    My guess is the Xpath generating code is considering zero based indexing instead of 1 based indexing (XPath uses 1 based indexing). Please let me know if I got something wrong in my understanding.

    Thanks

  • Mike September 2011

    Thanks for your question Palak. I suspect this has to do with a discrepancy in how the DOMs are parsed. Either nytimes.com is returning a different document to the Diffbot fetchers, or that document is parsed differently by the HTML parsers.

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Sign In Apply for Membership

In this Discussion

  • Mike September 2011

Tagged