< Code />

Use yql to retrieve your google local listing rank by dotnetjunkie, Wednesday, February 27, 2013.

One of our clients specializes in SEO for small businesses. A while back, we built them a dashboard application that pulled in Google Analytics data as well as data from their call tracking center. This way their clients could see all information in one place, website stats and phone calls. This helps with the correlation between phone calls and web traffic. So the other day, they sent a request over to pull page rank for a customers local listing. For example, if you are a bakery, you might search for bakery, Boise, Idaho. If your business showed up on the local listing, the results might indicate you are on the first page, fourth place down, etc.

Luckily, the url for retrieving a local listing is pretty straightforward. Using the example above, the url would look like this. https://www.google.com/maps?q=cakes+boise+id&ie=UTF-8

So they had their internal developer mock-up a sample which issued an PHP curl_exec, passing the URL which retrieves the entire markup for the google maps page. The goal was to isolate the portion of the page that contained the listings and determine the clients rank. They sent over the example and asked if we could build something similar in the dashboard.

So after bouncing some ideas around, we thought, this is a perfect scenario for Yahoo's YQL. Essentially YQL is a query language for the web and functions similarly to issuing SQL queries.

So using the YQL Console, we wrote a query that would retrieve the results from Google Maps. Pretty simple, all we have to do is issue a query to retrieve the content for the URL. Since I lived in Temecula California and there is a huge wedding industry out there, I used on of my friends as the test subject. She has an invitation business and is listed in the local listings. So running the following query through the YQL Console, I can see what results I have to work with.


// Return everything
select * from html where url = 'https://www.google.com/maps?q=Invitations+temecula+ca&&ie=UTF-8'

By issuing a select * we get everything contained within the page in either xml or json. YQL also allows you to issue XPath queries against the results. Luckily each listing is in the same format and looks like this.


<div id="panel_A_2" class="text vcard indent block">
    <div id="link_A_2" class="name lname"> ...
    <div> ...
    <div class="actbar-local-wrapper"> ...
</div>

So now we can expand our query to isolate the portion of html that represents each listing.


select * 
from html 
where url = 'https://www.google.com/maps?q=invitations+temecula+ca&&ie=UTF-8' and 
xpath='//div[contains(@class,"text vcard indent block")]/div/div/a'

Issuing the query above, we get the following results. However, there are some extra links that got pulled in as well. Each local listing may have reviews associated with them so in order to clean it up, we need to filter them out as well.


"results": {
   "a": [
    {
     "href": "https://www.google.com/local_url?dq=invitations+temecula+ca&q=http://www.createyourstruly.com/&ved=0CGIQ5AQ&sa=X&ei=8ysuUcb1IMidiQKovYC4Dw&s=ANYYN7lmGZzVsML5vT6e4m6ibvXNoxKPTQ",
     "target": "_blank",
     "span": "createyourstruly.com"
    },
    {
     "class": "pp-more-content-link",
     "href": "https://www.google.com/local_url?dq=invitations+temecula+ca&q=https://plus.google.com/116430657588690964284/about%3Fgl%3DUS%26hl%3Den-US&ved=0CGEQlQU&sa=X&ei=8ysuUcb1IMidiQKovYC4Dw&s=ANYYN7kMqu2NyEsZsy2BMkGQTZGncnAOtg",
     "target": "_blank",
     "span": "4 reviews"
    },
    {
     "href": "https://www.google.com/local_url?dq=invitations+temecula+ca&q=http://www.timessquarerepro.com/&ved=0CGsQ5AQ&sa=X&ei=8ysuUcb1IMidiQKovYC4Dw&s=ANYYN7nq_F2G-D5ClPVSrtj5ivMAzyCXzw",
     "target": "_blank",
     "span": "timessquarerepro.com"
    },
    {
     "class": "pp-more-content-link",
     "href": "https://www.google.com/local_url?dq=invitations+temecula+ca&q=https://plus.google.com/108703260778641449987/about%3Fgl%3DUS%26hl%3Den-US&ved=0CGoQlQU&sa=X&ei=8ysuUcb1IMidiQKovYC4Dw&s=ANYYN7k2319U-8e8U6YYq99zhjBO5rnTQg",
     "target": "_blank",
     "span": "3 reviews"
    },
    {
     "href": "https://www.google.com/local_url?dq=invitations+temecula+ca&q=http://www.thecopyzone.biz/&ved=0CHEQ5AQ&sa=X&ei=8ysuUcb1IMidiQKovYC4Dw&s=ANYYN7lVrhTHgc4o3OrxAQDINh4XA5TUeQ",
     "target": "_blank",
     "span": "thecopyzone.biz"
    },
    {
     "href": "https://www.google.com/local_url?dq=invitations+temecula+ca&q=http://www.designsbylenila.com/&ved=0CHgQ5AQ&sa=X&ei=8ysuUcb1IMidiQKovYC4Dw&s=ANYYN7nfZjrjy_HPkwNVfCElSLDfgcP8eA",
     "target": "_blank",
     "span": "designsbylenila.com"
    },
    {
     "class": "pp-more-content-link",
     "href": "https://www.google.com/local_url?dq=invitations+temecula+ca&q=https://plus.google.com/111990064689132427411/about%3Fgl%3DUS%26hl%3Den-US&ved=0CHcQlQU&sa=X&ei=8ysuUcb1IMidiQKovYC4Dw&s=ANYYN7kUEWLPIZPsn1wYKnstayS7lSlzMA",
     "target": "_blank",
     "span": "1 review"
    },
    {
     "href": "https://www.google.com/local_url?dq=invitations+temecula+ca&q=http://www.nomorepictureproblems.com/&ved=0CH8Q5AQ&sa=X&ei=8ysuUcb1IMidiQKovYC4Dw&s=ANYYN7mq6f_IBoOvY92EqgSqPzEWV2u6Dg",
     "target": "_blank",
     "span": "nomorepictureproblems.com"
    },
    {
     "href": "https://www.google.com/local_url?dq=invitations+temecula+ca&q=http://www.outreach.com/&ved=0CI0BEOQE&sa=X&ei=8ysuUcb1IMidiQKovYC4Dw&s=ANYYN7krr5bdB2cIdXFsJ_pXvs7i2_CumQ",
     "target": "_blank",
     "span": "outreach.com"
    },
    {
     "class": "pp-more-content-link",
     "href": "https://www.google.com/local_url?dq=invitations+temecula+ca&q=https://plus.google.com/109403675817977066018/about%3Fgl%3DUS%26hl%3Den-US&ved=0CIwBEJUF&sa=X&ei=8ysuUcb1IMidiQKovYC4Dw&s=ANYYN7nvh5uA8OvZnsYj3RzYMVl8jD_I3Q",
     "target": "_blank",
     "span": "3 reviews"
    },
    {
     "href": "https://www.google.com/local_url?dq=invitations+temecula+ca&q=http://creative-printing.us/&ved=0CJYBEOQE&sa=X&ei=8ysuUcb1IMidiQKovYC4Dw&s=ANYYN7lcWNMzx3x4KSKDkuuVOpjjlQfDqQ",
     "target": "_blank",
     "span": "creative-printing.us"
    },
    {
     "class": "pp-more-content-link",
     "href": "https://www.google.com/local_url?dq=invitations+temecula+ca&q=https://plus.google.com/102232367191471914842/about%3Fgl%3DUS%26hl%3Den-US&ved=0CJUBEJUF&sa=X&ei=8ysuUcb1IMidiQKovYC4Dw&s=ANYYN7neQ6NlhhwEiOkjlU10dusxHLgsXQ",
     "target": "_blank",
     "span": "2 reviews"
    },
    {
     "href": "https://www.google.com/local_url?dq=invitations+temecula+ca&q=http://www.twicetouchedtreasures.com/&ved=0CJ4BEOQE&sa=X&ei=8ysuUcb1IMidiQKovYC4Dw&s=ANYYN7kl8a2aLuYrFXOkt-le0xeSzs1sMQ",
     "target": "_blank",
     "span": "twicetouchedtreasures.com"
    },
    {
     "href": "https://www.google.com/local_url?dq=invitations+temecula+ca&q=http://www.print-kwik.com/&ved=0CKQBEOQE&sa=X&ei=8ysuUcb1IMidiQKovYC4Dw&s=ANYYN7ncjE10m5tDmTXqBoU-H6vATdWtTw",
     "target": "_blank",
     "span": "print-kwik.com"
    }
   ]
  }

This is great but all we really need are the url's so that we have something to compare. Expanding on our query, we issue another XPath expression that removes the reviews link and then refine our select to only look for span tags. Our final query looks like this.


select span 
from html 
where url = 'https://www.google.com/maps?q=invitations+temecula+ca&&ie=UTF-8' and 
xpath='//div[contains(@class,"text vcard indent block")]/div/div/a[not(contains(@class, "pp-more-content-link"))]'

Issuing this query in the YQL Console yields the results we are looking for.


"results": {
   "a": [
    {
     "span": "createyourstruly.com"
    },
    {
     "span": "timessquarerepro.com"
    },
    {
     "span": "thecopyzone.biz"
    },
    {
     "span": "designsbylenila.com"
    },
    {
     "span": "nomorepictureproblems.com"
    },
    {
     "span": "outreach.com"
    },
    {
     "span": "creative-printing.us"
    },
    {
     "span": "twicetouchedtreasures.com"
    },
    {
     "span": "print-kwik.com"
    }
   ]
  }

So far so good, we have a query that retrieves the data that we need. However, you may be asking a couple questions.

  1. What if the listing isn't on the first page?
  2. OK, we have the data in the console, how do we get the JSON in our app?

Each set of records are grouped in blocks. So page 1 is 0, page 2 is 10, page 3 is 20, etc. So we can append the query string with the following. &hl=en&sa=N&start=10 or &hl=en&sa=N&start=20, etc.

To get the JSON data into our app, we need to make a JSONP request, which we can do by borrowing a script from James Padolsey which extends jQuery's ajax function allowing us to make cross-domain requests. Here is what the code looks like.


jQuery.ajax = (function (_ajax) {

    var protocol = location.protocol,
        hostname = location.hostname,
        exRegex = RegExp(protocol + '//' + hostname),
        YQL = 'http' + (/^https/.test(protocol) ? 's' : '') + '://query.yahooapis.com/v1/public/yql?callback=?',
        xpath = xpath = "\'//div[contains(@class,\"text vcard indent block\")]/div/div/a[not(contains(@class, \"pp-more-content-link\"))]\'",
        query = 'select span from html where url="{URL}" and xpath=' + xpath;
    
    function isExternal(url) {
        return !exRegex.test(url) && /:\/\//.test(url);
    }

    return function (o) {

        var url = o.url;

        if (/get/i.test(o.type) && !/json/i.test(o.dataType) && isExternal(url)) {

            // Manipulate options so that JSONP-x request is made to YQL

            o.url = YQL;
            o.dataType = 'json';

            o.data = {
                q: query.replace(
                    '{URL}',
                    url + (o.data ?
                        (/\?/.test(url) ? '&' : '?') + jQuery.param(o.data)
                    : '')
                ),
                format: 'json'
            };

            // Since it's a JSONP request
            // complete === success
            if (!o.success && o.complete) {
                o.success = o.complete;
                delete o.complete;
            }

            o.success = (function (_success) {
                return function (data) {

                    if (_success) {
                        // Fake XHR callback.
                        _success.call(this, {
                            responseText: (data || '')
                        }, 'success');
                    }

                };
            })(o.success);

        }

        return _ajax.apply(this, arguments);

    };

})(jQuery.ajax);

Then, all we need to do is issue a request to pull in the data. The implementation is straightforward. Essentially we need to create a form that represents the various parameters of the url. We need to specify the domain name which would match the results. So something like example.com (without the http://www.). Then add a field for keywords. You can split them if you'd like but you can simply create one input box that you would enter something like "keyword+city+state", etc. Then just run a loop, starting with the first page and looking for the domain you are interested in. If it isn't found, go to the next page, etc. After the 4th page, it isn't worth searching. If you're that far out, better hire another company to help you get ranked better.


$.get('https://www.google.com/maps/?q=invitations+temecula+ca&ie=UTF8&hl=en&vps=1&sa=N&start=10', function (res) {
                
    if (res.responseText.query.results != null) {
        $.each(res.responseText.query.results.a, function (i, e) {
            console.log(e.span);
        });
    }
});

So, YQL is a pretty cool way to do some screen-scraping, not to mention it is pretty fast. However, what I learned is that YQL will cap you at 100 requests. And oddly enough, it doesn't tell you that you have. All that happens is when you reach your limit, it will just start returning null. Then you have to wait 24 hours from your last request before attempting to query again. Kind of a bummer because I was looking forward to using this solution. However, because my client has hundreds of customers and they would be running lots of queries, I couldn't take the chance and give them some software that may or may not work on a daily basis. So I ended up using the HTML Agility Pack and essentially doing the same thing, only processing it on our servers.

Anyways, I hope this was helpful to somebody, enjoy.

Comments

John, Friday, September 13, 2013 4:07 PM. reply

Very helpful, thank you!

Add Comment

Captcha