Because The Air Is Free -- Pablo Duboue's Blog

RSS Feed

Google-programming Considered Harmful

Sun, 06 Dec 2020 00:57:43 -0800

Tags: technical, education

permalink

Perl has been my to go language for scripts that need to be written fast and only executed once. I have become very effective writing Perl one liners that do things such as computing the most frequent URLs in Wikipedia:

bzcat enwiki-20180901-pages-articles-multistream.xml.bz2 | \
  perl -ne '@a=split(/http\s?\:\/\//,$_);@a=map{s/\/.*//;s/\|.*//;$_}@a;shift@a;print join("\n",@a)."\n" if@a'| \
  sort|uniq -c|sort -n|head

(Using a parallel implementation of BZip2, that takes about 5', the most popular websites are factfinder from census.gov, BBC and naco.org, in case you were curious.)

But that wasn't always the case. The thing with Perl is that I remember very well when and how I learned it. It was the year 1999, I had just started my PhD. The topic was information extraction on biomedical texts using statistical techniques. We were processing ten years of the journal from the European Molecular Biology association. Coming from my undergraduate thesis using Haskell for parsing Spanish texts, I started doing detagging using Haskell. That didn't work fast enough. My other choice was Java but by then I was behind on the work. I needed something fast to code and show results. My advisor at that time, Dr. Hatzivassiloglou (which was actually my second advisor and not my thesis advisor, I think that deserves another post in some moment) was very keen on Perl so I decided to learn the language of the camel.

To learn it, I got two books: Learn Perl in 24 hours and Effective Perl Programming. One was the quick intro to get going while the other deal with more obscure advanced topics. With the basic book, I was hacking Perl furiously by the end of the week and with the advanced book I managed to write more complex integration scripts. By the end of my PhD I had put the 10,000 hours and was very proficient in the language, a skill that remains to this day.

The path Book => Practice => Mastering has been with me ever since, for the few new languages I picked since then (Scala, Python, Elm, node.js).

These days, I'm learning Go and I'm trying to avoid the problem that has plagged my Python learning (I also started learning Python by reading a book but things went south from there).

It has to do with Google-programming, the practice of doing web search upon web search to build the code. This is way of programming works and gets the job done. But I find it ineffective. This post discusses my thoughts on the topic.

Why I find it ineffective?

  • It involves a mindset change, going from "speaking programming in my head" to "speaking English in my head", that gets me out of "the flow".

  • The example code found online needs adaption to the codebase I'm writing. This adaptation can be quite onerous, again taking the concentration away from what I'm coding.

  • Example code in the Web might have security and/or stability issues introduced by the very nature of simplifying the code to fit an example.

  • Leaving the IDE into the Web browser exposes you to be ambused by well trained algorithms intent of stealing your attention away. Those advertisements you will see will be very relevant to you and might derail your thought process for a long time.

  • This is a minor issue for most people buy it leaks information to the world and uses a lot of traffic and resources, both at Google and the target websites such as Stack Overflow.

  • Most importantly, the learning doesn't happen in the form of "programming need -> code structure" but instead in the form of "programming need -> Google query". This is not a minor issue. Google-programming makes you dependent on using Google over and over again. It makes programming off-line impossible, which is something desireable for a variety of reasons (long travel, living in areas with spottin internet, going off-the-grid to concentrate in a complex piece of code).

As such, for Golang, I'm trying something else. I again, started with two books (Go in Action and The Go Programming Language, great books, highly recommend them). Then I'm using Google-programming to bootstrap working examples (you can see one here) and then stopping the temptation of Google-programming new code but instead searching on my existing code base for working examples (that originated from google programming).

By the way, I'm not saying that searching in Google is bad for learning to program or that there is no space within the programming craft for doing web searches. The consensus seems to be clear that you are not a bad programmer if you do. My points here might also be a generational issue. Other people seem to be fine with Googling (but I wonder how that author would have behaved to having to program a second Netty server). There is a quote in there attributed to Albert Einstein 'Never memorize something that you can look up'. I don't think that applies to what we are talking about here. Einstein used some heavy tensor mathematics in relativity. I don't think he would advocate to people not taking such tools as innate. But as everything in life, YMMV!

Modern human nature is artificial

Sun, 06 Dec 2020 00:57:43 -0800

Tags: personal

permalink

Almost a year ago, my son was born. He came about one month earlier than expected and he almost crashed NeurIPS that we were attending together with the very pregnant wife.

The months that followed have been a lot of adaptation for the three of us. Annie went back to work in April and I have been taking care of the little boy since then up until last week when we got a spot in daycare. Parenting has been a lot of work and the pandemic has crushed some of our earlier plans of enjoying more time outdoors with the baby (but we have done fine and people have whole lives crushed by the pandemic so there is nothing we can really complain about it).

Besides mentioning I'm now a very old new dad, this blog post is about some cognitive disonance brought by raising the boy. In my mind, I had a dichotomy of "natural vs. artificial". We are "nature" and then, in some moment I thought, we learn the artificial. But nature comes first, because we are animals. Thus the narrative of "returning to nature", of "going back to where we belong" put forth from people who love the outdoors (just in case we never talked about it, I grew up alone in a forest, I had enough outdoors for the rest of my life, so they are not my thing).

Seeing the boy growing and learning to make sense of the world, I realized that narrative is incorrect. His world is 100% artifical and synchronic with our current technology. The technology of the generation he is born into is glued to the neural connections he makes as he grows. For him, a light switch is part of the world the same as dew is to a spider. And given the pandemic and the fact we live downtown, actual nature is odd to him. The first time he touched the bark of a tree he was truly puzzled. It was something new to him that felt... unnatural.

Thus, we can talk about "incorporating more nature in our lives" or "moving into nature" but the idea of "returning" only makes sense in an ancestral manner. And even then it is returning in a very limited sense (nobody is advocating living without an abode, even ancestral humans favored caves for a reason). But the point remains: the nature of modern humans is artificial. The connections in our brains generated at very early age are to make sense of carpets and furniture, of lighting fixtures and keyboards. That (seemingly obvious) fact had escaped me and wanted to share it with you all.

I will be sharing more parenting stories as they come my way, trying to respect his privacy as much as possible. Cheers.

Introducting P.A.P.E.R.

Sat, 21 Nov 2020 22:39:37 -0800

Tags: floss, academic

permalink

I got my first paper publish in 1996, in a conference in Antofagasta, Chile (the bus trip there was gruelling, that might be worth talking about in another post). It was in models and simulation, a joint work with Nicolas Bruno (and both ended up at Columbia University for PhD, that's yet for another blog post). From there I went to do my undergraduate thesis in Spanish parsing using LFGs in Haskell. Later on I continued working on model and simulation before getting into PhD. In the PhD I went through three advisers, two in my first year, where I worked in word sense disambiguation before moving to natural language generation in the medical domain. My final years of PhD where dedicated to natural language generation for intelligent analysis. At IBM Research, I moved into question answering, initially in the Human Resources domain, with a detour on expert search before settling on the Watson Jeopardy! project. Each change of topic and domain involved extensive background research to get myself up to speed. After I left IBM and started doing consulting work, it got even worse, so I won't bore you with the details. How to keep track of all that information in my head?

In 2012, I came to terms I had to spend less time reading research and more time tracking what I read. Everytime we read something, it is for a purpose. Without keeping extra metadata, it start becoming akin to haven't read the papers at all. It is not that I have a particularly bad memory, but after a few hundred papers, the names of the authors and the titles start escaping me. Interestingly, I remember under which circumstances I found the paper or where (as the place or the device or printout) I read it. Therefore, I decided to use a tool to keep track of such metadata.

After some search of various available tools, I decided to write my own, which I did and I have been using for many, many years. I credit the management of extended number of sources I had to go through for writing my book to having this tool, which I silly named P.A.P.E.R. (Pablo's Artifacts and Papers Environment plus Repository). This month I open sourced the tool at a lighting talk in Vancouver's Learn Data Science meetup. This post describes the tool, which is seeking beta testers, users and contributors (your help, basically).

I wrote a book!

Sun, 08 Nov 2020 20:09:07 -0800

Tags: academic, personal

permalink

I haven't been blogging much as I was busy writing The Art of Feature Engineering, a book that came out in the summer on Cambridge University Press. Here are some thoughts about the book writing process itself, condensed and expanded from its introduction. Also some thoughts on earlier criticism it has encountered.

My interest with feature engineering started working together with David Gondek and the rest of the Jeopardy! team at IBM TJ Watson Research Center in the late 2000s. The process and ideas in the book draw heavily from that experience. The error analysis sessions chaired by David Ferrucci were grueling two days of looking at problem after problem and brainstorming ideas of how to address them. It was a very stressful time; hopefully this book will help you profit from that experience without having to endure it. Even though it has been years since we worked together, the book exists thanks to their efforts that transcend the show itself.

After leaving IBM, during the years I have been doing consulting, I have seen countless professionals abandon promising paths due to lack of feature engineering tools. I wrote the book for them.

Summary for Leo Breiman: The Two Cultures

Sat, 23 Mar 2019 08:13:20 -0700

Tags: academic

permalink

Vancouver has a great machine learning paper reading meet-up aptly named Learn Data Science. Every two weeks, the group gets together and discuss a paper voted by the group the previous meeting. There are no presentations, just going person by person around the room discussing the paper. I find the format very positive and I try to attend whenever the selected paper is aligned with my interests and I have time (lately I have not been able to attend much, as I have been writing a feature engineering book but I'll blog about it another day). Last meet-up they picked a paper I selected, so this blog is a short summary of the paper, to help people reading it.

In 2005, Prof. Leo Breiman (of Random Forests fame), passed away after a long battle with cancer. He was 77 years old. This information comes from his obituary, that also highlights a very varied life that is captured by the paper for the Learn DS meetup of March 27th. Four years before passing away, when he was well established and well regarded in the field, he wrote Statistical Modeling: The Two Cultures a journal article for Statist. Sci., Volume 16, Issue 3 (2001), 199-231. The paper presents the field in a pivotal moment on its history, as it is about that time that "Data Science" took off (for example, Columbia University started publishing The Journal of Data Science in 2003). The paper itself is written in a clear tone with an unambiguous message but it got published with four letters and a rejoinder. The subtlelties in the letters and rejoinder present the most interesting material. Particularly as the letters included well known statisticians such D.R. Cox and Emanuel Parzen. I will summarize this content in turn.

The Value of Outliers

Tue, 03 Oct 2017 02:49:30 -0700

Tags: academic, philosophical

permalink

Moving coast-to-coast has taken most of my energy since the last post but we're finally established in Vancouver so I can come back to blogging more regularly. This post is about a piece of advice from the classic book How To Lie With Statistics extended to groups and society in general. It continues ideas from two other blog posts (What Random Forests Tell Us About Democracy and Have Humans Evolved to be Inaccurate Decision Makers?) regarding statistics, decision making and politics.

Let's sample 10 (pseudo)random numbers from a normal distribution centered around 100:

       >>> map(lambda x:int(random.normalvariate(100,10)), range(10))
       [85, 99, 97, 78, 87, 93, 91, 112, 90, 91]
    

Now, if you look at these numbers, you'll be tempted to conclude that 112 is just... wrong. A measurement error. That it does not belong there. However, the average for the 10 numbers is 92.3, still far from the real mean (100) but within the sigma we used to generate the sample (10). If we were to drop 112, the average for the remaining numbers will go down to 90.1, making it a worse estimator than before.

I believe the same happens in the realm of ideas. If each person has a piece of the truth, shutting down their contributions, irrespective of how far from the truth their might sound, will lead you farther away from the truth. A similar concept in the business world is Groupthink. This of course does not mean that outliers need to dominate, just not eliminated completely.

And if you haven't read the 1954 book by Darrell Huff, it is very short and makes for a great read. Its starting premise "democracy needs voters informed on basic statistical matters" is as up-to-date in our data-driven world as ever.

More Like This Queries on SQlite3 FTS

Mon, 12 Dec 2016 03:05:48 -0500

Tags: open source

permalink

SQLite is a great embedded SQL engine, and part of the Android platform. It has an extension FTS ("Full Text Search") that enables Boolean search queries (that is, mostly unranked). For small collections of documents (like a blog), Boolean searches could be a viable temporary solution until a full solution (like elasticsearch) can be deployed.

A common type of query supported by elasticsearch are More Like This (MLT) queries, that allow you to find similar documents to given ones. This type of queries are also very useful for blogs, for example. If you're using SQLite FTS, you can construct a query that will approximate MLT by issueing an OR for all terms in a document (in FTS, the terms are lowercase and uppercsase 'OR' or 'AND' are considered logical Boolean operators). The only issue is to obtain the terms, as assigned by FTS. To do so, it is necessary to access the FTS tokenizer by creating virtual table, for example:

CREATE VIRTUAL TABLE tok1 USING fts3tokenize('simple');

Then, given a document, the terms for it can be extracted by doing (in PHP):

$tokens = $db->query("SELECT token FROM tok1 WHERE input='" . SQLite3::escapeString($all_text) ."';");

The query itself can be assembled by taking an OR of the set of terms:

$query = "";
$query_array = array();
while($row = $tokens->fetchArray()){
    $token = $row['token'];
    if(! isset($query_array[$token])){
        $query_array[$token] = 1;
        if(strlen($query) > 0) {
            $query = $query . " OR ";
        }
        $query = $query . $token;
    }
}

This is the query that can then be used against the FTS to provide a makeshift MLT functionality.

Some ideas for the A.I. XPRIZE

Wed, 23 Nov 2016 00:31:29 -0500

Tags: research

permalink

As I'm married to a current IBM emloyee, I'm disqualified from participating in the AI XPRIZE sponsored by IBM. So I'm putting my ideas in this blog post, might they help inspire other people.

The XPRIZE follows the path of other multi-year challenges that have resulted in great accomplishments such as commercial rockets. In the case of the AI challenge, it diverges from previous challenges by being completely open ended: any major AI break-through can win the 3M USD prize.

What I'd like to see is a team tackling improvement in scientific communications by leveraging recent advances in machine reading and taking them to the next level. I would like to see some work on scientific metadata (possibly in the directions of the Semantic Web) that captures the main dicoveries in a scientific paper. This metadata should be feasible to be produced by a human, the machine reading aspect is there just to bring enough value to the metadata during transition to entice humans to self-annotate.

The case for this improvement lies in the amount of researchers in many key fields having physically no time to keep up-to-date with published results. A high-level summary or the possibility to query "has anybody applied method X to problem Y" would be invaluable. Moreover, this type of setting allows for a very constrained inference, simplifying scientific discovery for sometimes obvious, sometimes overseen new findings.

I'm not stranger to this approach. My most cited paper came to be as a contributor to a multidisciplinary project on automatic extraction and inference in the genomics domain (some form of automated inference was later realized many years after I left the project).

This is further simplified by reporting standards in many scientific disciplines. Take for example, the one from the American Psychological Association (thanks to Emily Sheepy to pointing me to that report). Such standards specify the type of contributions and the information to be expected on them, even up to the headers of each section.

I believe all these pieces together have a chance of, if now winning, at least doing well in the competition. And irrespective of the competition, this technology deserves to exist and help accelerate human discovery.

Regarding business aspects, it would be nice if the metadata format is an open format and the commercialization centers on extracting metadata from existing publications and authoring tools. Extracting metadata and doing inferences for profit is somewhat contrary to the goals of accelerating research, but that's speaking as a scientist, not a business person.

Let me summarize the concept with an example:

  1. Given an existing paper, for example Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network (Toutanova et al., 2003) produce metadata of this type:

    problem

    Part-of-Speech tagging (as a link to an ontology)

    solution

    Conditional Markov Model (or CMM with features x,y,z; all linked to an ontology)

    results

    97.24% accuracy over Penn Treebank WSJ (the metric and the corpus are also links)

  2. These entries can be further populated by the scientists upon publication, maybe with the help of an authoring tool.

  3. From this metadata, a system can answer "what is the best performance for POS and what technique does it use?" but also "POS and role labeling are similar problems (fictional fact): both use similar techniques and both rank their performance similarly; however, the best performance in POS is using skip decision lists (also a fictional fact) but that technique has never been attempted on role labeling"

Wish the participants the best of luck. Look forward seeing great technology being developed as a result of the challenge!

Older Posts