20 March 2006

ETech 2006: ...Better Collaborative Filtering

The talk by Charles Armstrong (an ethnographer) and CEO of Trampoline Systems was quite interesting. He studied under sociologist Lord Young of Dartington and at the School for Social Entrepreneurs (SSE). While studying at SSE he spent a year on the Isles of Scilly (photos), where he helped people learn computer skills. It was here that he studied the flow of information (or news) within the community.

What was intriguing was how efficiently information valuable to a specific person or group was passed on to them as it came to the island (can't remember which island he was on, I think the population was about 100). Usually a boat would come to the island daily delivering supplies and information or news would be passed to the people helping unload it. For example, maybe the boat would not be coming tomorrow so the news would pass to everyone who planned to take the boat to the mainland the next day - well before the end of the day.

His talked really had me thinking about Watts and Newmans paper Identity and Search in Social Networks. I think Watts's description in his book Six Degrees: The Science of a Connected Age somewhat models what Armstrong observed.

His company, Trampoline Systems, provides "a technology that helps groups of people connect, collaborate and manage large quantities of information." From the presentation I concluded that the system would certainly provide an excellent means to do social network analysis of a company via its emails. I believe the system involves configuring your current servers to send copies of emails to a "database". From the Tos and Froms a network can be built, as in the photo on the left. (I'm not positive about the last three sentences.) Some type of content analysis of the emails is under development (maybe some type of automatic categorization) to facilitate the sharing of information. I didn't get how the this part of the system will work. I talked to him later in the day about the system but unfortunately we did not have time to go into any details.

To get a non-technical understanding of good verses bad social or project networks within a company I would recommend The Hidden Power of Social Networks: Understanding How Work Really Gets Done in Organizations by Robert L. Cross, et al...

17 March 2006

Sony Bravia Commercial


Both photos are by sepiatone in his Sony Commercial flickr set.

I know this is old news but this is such a cool commercial. I so wish I could have been there in person.

15 March 2006

ETech 2006: Future of Interfaces Is Multi-Touch


Jeff Han's demo of his multi-touch interface was outstanding and pretty much stole the show for the day (March 7). The applications are limitless and could open up a whole new facet of user interfaces. Further, Han commented that the hardware technology was inexpensive. To appreciate the technology you really have to see it. There is a demo video on Han's web site and here is a video of his demo at ETech from YouTube. Take note in the demo videos how he uses hand gestures to change the angle of view of 3D images.

Photo courtesy of James Duncan Davidson/O'Reilly Media.

14 March 2006

ETech 2006 Session: Scaling Fast and Cheap

Introduction
Cal Henderson's all day talk on "Scaling Fast and Cheap - How We Built Flickr" is packed with information about building a scalable enterprize web application using predominantly open source software, some custom software, and very little commercial software. Anyone from a beginner to an experienced web applications developer would find the talk useful. Especially since Cal was open to any questions. However, if you want a talk on the history of Flickr this is not the talk for you. Currently flickr runs on about 200 servers and has about 600 terabytes of storage, not including redundancy it is about 200 terabytes. There are two data centers, one in Texas and one in Virginia. The development team is about 10 engineers and operational support is now handled by Yahoo's operations team, but previously it was about 4 people.

Keep it simple
He started out his talked with a picture of the godfather of computer science, Donald Knuth and his famous quote "premature optimization is the root of all evil." (Misattributed quote, see comments). The point he drove home was not to waste man power optimizing the small stuff or the stuff that won't need optimizing. Almost all optimization will be in the database and hardware configuration of your disk storage system.

In that vain he recommended buying commodity hardware and install the standard version of linux, with one exception, use the compiled binaries of MySQL from www.mysql.com -- don't try to compile your own. Further, run mysql on a 64 bit machine to get around the memory limits of 32 bit machines.

Don't lock yourself into a hardware platform if you can help it -- especially if you are growing fast. If you do, check that the components (disk drives, network cards, etc...) your vendor uses stay compatible with your version of linux. If you can't do that, then plan ahead for some time to develop for a possibly new version of linux.

The main software components of flickr are Linux (I think RedHat, Debian, SuSi), JVM, Smarty, Mysql, Apache2, PHP4. Consistency across your systems is key to ensuring ease of maintainance and ease of development.

Use Version Control for Everything
Use version source control (CVS or Subversion) for everything, -- and as hard as it is -- put useful comments in the version control system. I personally like to recall a quote from (I think) Daiman Conway "Document your code as if a homicidial maniac who knows where you live will be taking over development of your code." To help with this use a simple CVS or Subversion program on the client side -- let your developers use which ever one they like. Also, put everything into the version control system, application code, system configuration files (apache, php, etc...), documentation, etc... Set standards for naming files, database tables function and object names, etc... Don't worry about settling on the perfect one, the greatest benefit is in everyone using the same standard. Use a bug tracking system like FogBugz, Mantis, RT, or Bugzilla. Get disciplined and fix bugs before doing development on the next release. Fix the easy bugs first (low hanging fruit). Categorized your bugs, P1 the production site is down, P2, causes the staging site to go down, P3, does not bring the site down but really needs to be fixed to maximize the user experience, P4, it is a bug and no one will every notice it.

Local, Development, Staging, Production
Application development occurs in four segments, local, development, staging, production. Local is the developers local machine where they have the component of the system they are developing on installed. Development is the developing version of the site, the lowest level that all the components are together. Staging is the almost live version of the site -- the in house test site. Production is the live site and is only updated via the staging site via a very simple interface -- one button click. So scripts are written to deploy the latest version of the staging code to the production site. Hence, it is easy to roll back to a pervious version of the code on the production server from the staging server. This is important because you will never ever be able to fully test a web application. There will always been bugs that can only be revealed on the live site.

Unicode
Flickr supports unicode and it is easy to support this in a web application. The hard part is data integrity. What do we do with invalid unicode? First, set up a data intregity policy for the site. Flickr filters the data (comments, titles, etc...) before it is stored to ensure it is valid unicode.

HTML and SQL Spoofing
Displaying user entered HTML on your site is a really stupid idea from a security point of view since there are so many spoofing attacks. However, if you are going to do this use an open source library like lib_filter to clean up the input first. SQL injection attacks are another problem. Use just-in-time escaping of the SQL input and never grant more permissions for a database user than necessary.

Email
Even with all the RFC's (561, 822, 1521) for an email standard a lot of email still does not adhere to these, especially mobile phone providers. Flickr uses PEARS's Mail:mimeDecode, iconv, and custom code to parse email. When you send a photo to your flickr account via a cell phone a lot of code goes into getting the title, comment, and photo out of the email. Some providers seem to take advantage of this feature to advertise via by replacing the subject line of the email with an advertisement or including icons at the bottom of the email.

BottleNecks
Bottlenecks almost always occur either because of swapping to disk because of high memory usage or non-optimized database queries. While it is possible you could have a CPU bottle neck if your web application requires some heavy crunching (image or video editing), it is usually very rare. Don't look for a bottleneck until it occurs. Remember, pre-optimization if the root of all evil. Try to build in some stats gathering into your site to help locate any bottlenecks, e.g., add a feature to optionally log the time of queries or other processes. Mysql indexes are a "bizarre black magic", when it comes to optimizing them consider hiring a Mysql expert.

Site Monitor
When it comes to monitor the site Cal was very happy with Ganglia for gathering all kinds of trend (past hour, day, week, month, year) stats and Nagios for real-time health (is everything up and running as it should be) monitoring of the site.

Scalability
Cal defined scalability as horizontal scaling; buying more servers not buying bigger servers. He centered mostly on how to get MySQL to scale. He describe the different backends to MySQL, MyISAM, BDB, InnoDB, and Heap, and the pros and cons of each and how best to use them. The most interesting part was their use of MySQL replication, which does not get you true horizontal scaling. It gives you the capability to handle a lot more reads to the databases but every database receives the same number of writes. Most web applications do far more reads than writes to a database. I won't get into the details of their Master/Slave database setup. One drawback to this is replication-lag – which you can experience on flickr with the tags. If you add a tag to a photo on flickr and then search for photos with that tag a few seconds latter the photo may not be returned. The problem is that your search request was handled by one of the database slaves that was not updated yet.

True scalability somes with MySQL5 and Orcale RAC ($25K per processor). MySQL5 is not ready for prime time. So the only option is to use Orcale to write your own code to do this – which gets really complicated fast. Not recommended unless you really know what you are doing and have simple database SELECT's. Currently Flickr's architecture does not scale horizontally. They just keep stretching their current architecture.

Storage
I got the impression from the last part of Cal's talk that the image storage system is really not too complicated. He didn't get into the details of what hardware systems they are using but it is grown by just adding more hardware (scales horizontally). The location of the images on the system is stored in the database. I think they have written their own code or protocol to fetch the image from the storage system instead of using ftp or nfs – for speed mostly likely.

He ran out of time before he could talk about RSS support and the flickr API.

Building Scalable Web Sites by Cal Henderson will be available soon.