Saturday, January 26, 2013

Does Google know who I am? (considering that I have already told him...)

Today I have sent an email to give my opinion about a service and to ask the service provider to consider an improvement.

When I was just about to send it I wondered if the receiver could have been able, if interested in what I have written, to do a lookup up of my mail address and find the pages that represent me better since the address is the one I use in formal communication.

As pages that represent me I mean stuff like my Facebook, Google+, LinkedIn pages, in my case.

And since my email was in the form of "NAME.SURNAME@gmail.com" , a typical standard if you are lucky enough to find it available when you create an email account with a specific provider, I was expecting it to work properly.

So I performed a test and I have browsed for my official email address in Google search, and to try limit as much as possible, all the tracking informations that my browser could send or remember, I performed my test with an instance of Firefox in Private Mode.

And the result turned to be interesting:

Google identified me correctly... for the first 4 results:

  1. It finds one of my projects on GitHub
  2. It finds my national LinkedIn page
  3. It finds me on LinkedIn.com
  4. It finds my Google+ Page

But it screws it completely for the rest of the first result page links:

For what I have seen from those links, yes, I can say that both my name and my surname, taken independetly are present in the results, but not only I have nothing to do with those pages, but my original query, my NAME.SURNAME@gmail.com email is not there at all and they are not even listing my omonimous.
The pages are not even including the NAME.SURNAME string, that I could expect it may exists as the username chosen by any of my omonimous that could have open an account with providers other than Gmail.

Instead no, the logic that I can guess is that the Google algorithm has not identified my query as an email address and looked just for that.
This behaviour is not completely surprising, since I can expect that the "Did you mean?" functionality could be based on some soundex algorithm or eventually on other statistics and metrics, but the suggested pages are not containing any evident variation of my email address.

It seems to me that email addresses are searched just like any other query on Google and no particular optimization is applied to them. This is definitely surprising, considering the many optimizations or even easter eggs that we can find in the engine:

Try to search for "Apple stock", "1 eur in dollar" or pay attention to the suggested correction when trying to search for "recursion".

I am a software engineer but not an expert in search engines at all, so I do not know if the problem that I am describing is crazy complex or not, but from a user point of view, I do believe that a very common use case is not correctly managed by the search engine.

I know that Search Engine Optimization is a discipline on its own, but my use case is much simpler I think.

From a smart search engine I would expect that if search for an email, the engine would be able to automatically try to look for just the sequence of characters that I have put in the search bar.
Eventually I'd like to receive some suggestion for eventual typos if the system does not find results. I could also accepts suggestion based on similar words, but still in the context of email addresses not just in the the body of other pages.

From a smarter search engine I would expect it to guess that a TOKEN1.TOKEN2 would lead the engine to at least give priority to the option that TOKEN1 could be my name and TOKEN2 could be my surname, and eventually enforce its opinion based of some statistic that could prove that TOKEN1 is indeed a common first name.

I'm saying it again. I really have no clue how doable this idea is, but I do believe that it should not be much harder than now when parts of my search results are correct and others are instead very unrelated to search.

Other interesting considerations based only on my single test:

  • Google finds a page with my full email on Github, because it was on a README text file that I have uploaded there, but it's not suggesting my profile page that still shows publicly my email address.
  • Google+, that also has my official mail public, it's only fourth
  • the ninth result, that is a YouTube page, finds a post of one omonimous of mine.
  • when I searched Google passing my email enclosed by quotes, I receive only 2 results back: the same GitHub page and a scam page.

Java - Handmade Classloader Isolation

In a recent project we had a typical libraries conflict problem.

One component that we could control wanted a specific version of an Apache Commons library, while another component was expecting a different one.

Due to external constraints we could not specify any class loading isolation at the Container level. It wasn't an option for us.

What we decided to do instead has been to use the two different classes definition at the same time.

To obtain this we had to let one class be loaded by the current thread class loader and to load manually the second one; in this way the two classes still have the same fully qualified name.

The only restriction to this approach is the we had to interact with the manually loaded class only via reflection, since the current context, that is using a different class loader, has a different definition of a class and we would be able to cast or assign a instance of the class loaded with a classloader to a variable defined in the context of the other.

Our implementation is in effect a Classloader itself:

DirectoryBasedParentLastURLClassLoader extends ClassLoader

The characteristic of this Classloader is that we are passing it a file system folder path:

public DirectoryBasedParentLastURLClassLoader(String jarDir)

Our implementation scans the filesystem path to produce URLs and uses this information to pass them to a wrapped instance of a URLClassLoader that we are encapsulating with our CustomClassloader:

public DirectoryBasedParentLastURLClassLoader(String jarDir) {
    super(Thread.currentThread().getContextClassLoader());

    // search for JAR files in the given directory
    FileFilter jarFilter = new FileFilter() {
        public boolean accept(File pathname) {
            return pathname.getName().endsWith(".jar");
        }
    };

    // create URL for each JAR file found
    File[] jarFiles = new File(jarDir).listFiles(jarFilter);
    URL[] urls;

    if (null != jarFiles) {
        urls = new URL[jarFiles.length];

        for (int i = 0; i < jarFiles.length; i++) {
            try {
                urls[i] = jarFiles[i].toURI().toURL();
            } catch (MalformedURLException e) {
                throw new RuntimeException(
                        "Could not get URL for JAR file: " + jarFiles[i], e);
            }
        }

    } else {
        // no JAR files found
        urls = new URL[0];
    }

    childClassLoader = new ChildURLClassLoader(urls, this.getParent());
}

With this setup we can override the behaviour of the main classloading functionality, giving priority to the loading from our folder and falling back to the parent classloader only if we could find the requested class:

@Override
protected synchronized Class loadClass(String name, boolean resolve)
        throws ClassNotFoundException {
    try {
        // first try to find a class inside the child classloader
        return childClassLoader.findClass(name);
    } catch (ClassNotFoundException e) {
        // didn't find it, try the parent
        return super.loadClass(name, resolve);
    }
}

With our CustomClassloader in place we can use it in this way:

//instantiate our custom classloader
DirectoryBasedParentLastURLClassLoader classLoader = new DirectoryBasedParentLastURLClassLoader(
        ClassLoaderTest.JARS_DIR    );
//manually load a specific class
Class classManuallyLoaded = classLoader
        .loadClass("paolo.test.custom_classloader.support.MyBean");
//request a class via reflection
Object myBeanInstanceFromReflection = classManuallyLoaded.newInstance();
//keep using the class via reflection
Method methodToString = classManuallyLoaded.getMethod("toString");
assertEquals("v1", methodToString.invoke(myBeanInstanceFromReflection));

This idea for this post and part of its code come from this interesting discussion on Stackoverflow

A fully working Maven project is available on GitHub with a bunch of unit tests to verify the right behaviour.

Tuesday, January 15, 2013

Java: Rest-assured (or Rest-Very-Easy)


Recently I had to write some Java code to consume REST services over HTTP.

I've decided to use the Client libraries of RestEasy, the framework I use most of the time to expose REST services in Java, since it also implements the official JAX-RS specification.

I am very satisfied with the annotation driven approach that the specification defines and it makes exposing REST services a very pleasant task.

But unluckily I cannot say that I like the client API the same way.

If you are lucky enough to be able to build a proxy client based on the interface implemented by the service, well, that's not bad:

import org.jboss.resteasy.client.ProxyFactory;
...
// this initialization only needs to be done once per VM
RegisterBuiltin.register(ResteasyProviderFactory.getInstance());


SimpleClient client = ProxyFactory.create(MyRestServiceInterface.class, "http://localhost:8081");
client.myBusinessMethod("hello world");
Having a Proxy client similar to a  JAX-WS one is good, I do agree. But most of the time, when we are consuming REST web service we do not have a Java interface to import.
All those Twitter, Google or whatever public rest services available out there are just HTTP endpoints.

The way to go with RestEasy in these cases is to rely on the RestEasy Manual ClientRequest API:

ClientRequest request = new ClientRequest("http://localhost:8080/some/path");
request.header("custom-header", "value");

// We're posting XML and a JAXB object
request.body("application/xml", someJaxb);

// we're expecting a String back
ClientResponse<String> response = request.post(String.class);

if (response.getStatus() == 200) // OK!
{
   String str = response.getEntity();
}
That is in my opinion a very verbose way to fetch what is most of the time, just a bunch of strings from the web. And it gets even worse if you need to include Authentication informations:

// Configure HttpClient to authenticate preemptively
// by prepopulating the authentication data cache.
 
// 1. Create AuthCache instance
AuthCache authCache = new BasicAuthCache();
 
// 2. Generate BASIC scheme object and add it to the local auth cache
BasicScheme basicAuth = new BasicScheme();
authCache.put("com.bluemonkeydiamond.sippycups", basicAuth);
 
// 3. Add AuthCache to the execution context
BasicHttpContext localContext = new BasicHttpContext();
localContext.setAttribute(ClientContext.AUTH_CACHE, authCache);
 
// 4. Create client executor and proxy
httpClient = new DefaultHttpClient();
ApacheHttpClient4Executor executor = new ApacheHttpClient4Executor(httpClient, localContext);
client = ProxyFactory.create(BookStoreService.class, url, executor);

I have found that Rest-assured  provide a much nicer API to write client invocations.
Officially the aim of the project is to create a testing and validating framework; and most of the tutorials out there are covering those aspects, like the recent Heiko Rupp's one: http://pilhuhn.blogspot.nl/2013/01/testing-rest-apis-with-rest-assured.html

I suggest  yout, instead, to use it as a development tool to experiment and write REST invocation very rapidly.

What is important to know about rest-assured:

  •  it implements a Domain Specific Language thanks to fluid API
  •  it is a single Maven dependency
  •  it almost completely expose a shared style for both xml and json response objects
  •  it relies on Apache Commons Client

So, I'll show you a bunch of real world use cases and I will leave you with some good link if you want to know more.

As most of the DSL on Java, it works better if you import statically the most
important objects:
import static   com.jayway.restassured.RestAssured.*;
import static   com.jayway.restassured.matcher.RestAssuredMatchers.*;
Base usage:
get("http://api.twitter.com/1/users/show.xml").asString();

That returns:

  Sorry, that page does not exist

Uh oh, some error. Yeah, we need to pass some parameter:
with()
    .parameter("screen_name", "resteasy")
.get("http://api.twitter.com/1/users/show.xml").asString();

That returns:

  27016395
  Resteasy
  resteasy
  
  http://a0.twimg.com/sticky/default_profile_images/default_profile_0_normal.png
  https://si0.twimg.com/sticky/default_profile_images/default_profile_0_normal.png
  
  jboss.org/resteasy

JBoss/Red Hat REST project
  false
  244
  C0DEED
  333333
  0084B4
  DDEEF6
  C0DEED
  1
  Fri Mar 27 14:39:52 +0000 2009
  0
  
  
  http://a0.twimg.com/images/themes/theme1/bg.png
  https://si0.twimg.com/images/themes/theme1/bg.png
  false
  true
  false
  false
  8
  en
  false
  false
  21
  true
  true
...

Much better! Now, let's say that we want only a token of this big String XML:
with()
    .parameter("screen_name", "resteasy")
.get("http://api.twitter.com/1/users/show.xml")
    .path("user.profile_image_url")

And here's our output:
http://a0.twimg.com/sticky/default_profile_images/default_profile_0_normal.png

What if it was a JSON response?
with()
    .parameter("screen_name", "resteasy")
.get("http://api.twitter.com/1/users/show.json")

And here's our output:
{"id":27016395,"id_str":"27016395","name":"Resteasy","screen_name":"resteasy","location":"","url":null,"description":"jboss.org\/resteasy\n\nJBoss\/Red Hat REST project","protected":false,"followers_count":244,"friends_count":1,"listed_count":21,"created_at":"Fri Mar 27 14:39:52 +0000 2009","favourites_count":0,"utc_offset":null,"time_zone":null,"geo_enabled":false,"verified":false,"statuses_count":8,"lang":"en","status":{"created_at":"Tue Mar 23 14:48:51 +0000 2010","id":10928528312,"id_str":"10928528312","text":"Doing free webinar tomorrow on REST, JAX-RS, RESTEasy, and REST-*.  Only 40 min, so its brief.  http:\/\/tinyurl.com\/yz6xwek","source":"web","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"geo":null,"coordinates":null,"place":null,"contributors":null,"retweet_count":0,"favorited":false,"retweeted":false},"contributors_enabled":false,"is_translator":false,"profile_background_color":"C0DEED","profile_background_image_url":"http:\/\/a0.twimg.com\/images\/themes\/theme1\/bg.png","profile_background_image_url_https":"https:\/\/si0.twimg.com\/images\/themes\/theme1\/bg.png","profile_background_tile":false,"profile_image_url":"http:\/\/a0.twimg.com\/sticky\/default_profile_images\/default_profile_0_normal.png","profile_image_url_https":"https:\/\/si0.twimg.com\/sticky\/default_profile_images\/default_profile_0_normal.png","profile_link_color":"0084B4","profile_sidebar_border_color":"C0DEED","profile_sidebar_fill_color":"DDEEF6","profile_text_color":"333333","profile_use_background_image":true,"default_profile":true,"default_profile_image":true,"following":null,"follow_request_sent":null,"notifications":null}

And the same interface undestands JSON object navigation. Note that the navigation expression does not include "user" since it was not there in the full json response:
with()
    .parameter("screen_name", "resteasy")
.get("http://api.twitter.com/1/users/show.json")
    .path("profile_image_url")

And here's our output:
http://a0.twimg.com/sticky/default_profile_images/default_profile_0_normal.png

Now an example of Path Parameters:
with()
    .parameter("key", "HomoSapiens")
.get("http://eol.org/api/search/{key}").asString()

Information about the http request:
get("http://api.twitter.com/1/users/show.xml").statusCode();
get("http://api.twitter.com/1/users/show.xml").statusLine();

An example of Basic Authentication:
with()
  .auth().basic("paolo", "xxxx")
.get("http://localhost:8080/b/secured/hello")
  .statusLine()

An example of Multipart Form Upload
with()
    .multiPart("file", "test.txt", fileContent.getBytes())
.post("/upload")

Maven dependency:

 com.jayway.restassured
 rest-assured
 1.4
 test


And a Groovy snippet that can be pasted and executed directly in groovyConsole thanks to Grapes fetches the dependencies and add them automatically to the classpath, that shows you JAXB support:
@Grapes([  
    @Grab("com.jayway.restassured:rest-assured:1.7.2")
])
import static   com.jayway.restassured.RestAssured.*
import static   com.jayway.restassured.matcher.RestAssuredMatchers.*
import  javax.xml.bind.annotation.*


@XmlRootElement(name = "user")
@XmlAccessorType( XmlAccessType.FIELD )
    class TwitterUser {
        String id;
        String name;
        String description;
        String location;

        @Override
        String toString() {
            return "Id: $id, Name: $name, Description: $description, Location: $location"
        }

    }

println with().parameter("screen_name", "resteasy").get("http://api.twitter.com/1/users/show.xml").as(TwitterUser.class)

//


This is just a brief list of the features of the library, just you an idea of how easy it is to work with it. For a further examples I suggest you to read the official pages here:

https://code.google.com/p/rest-assured/wiki/Usage

Or another good tutorial here with a sample application to play with:

http://www.hascode.com/2011/10/testing-restful-web-services-made-easy-using-the-rest-assured-framework