Archive for the ‘Miro Guide’ Category

Lessons Learned from other sites

Thursday, November 1st, 2007

I’ve been reading some of the architecture articles on High Scalability (YouTube, Twitter, Amazon, Flickr) and while the descriptions of hardware aren’t particularly useful to me (Miro Guide is one, about to be two, servers; not that big), but a lot of the tips and lessons learned I think may be applicable.

  • Keep it Simple: I was a victim of this earlier, trying to make a system that was too complicated.
  • Don’t make the database the central bottleneck: We’re running into this problem now, where sometimes a simple insert will time out because the database is being hit so much.
  • Denormalize: I haven’t done this, nor do I have plans to in the near future, but it’s a least something to think about.
  • Avoid complex joins: Nor this, but it may be becoming a problem. The next item may help with this
  • Cache everything: We don’t do this very well, and it’s one of my goals. We just added a Squid server, so giving some hints to that about what’s changed will be good. Also, using memcached a bit more.
  • Make your website an open service by creating an API. After the 1.0 release, this is one of the big features I want to work on. I think it’ll make Miro Guide a lot more useful both inside and outside of Miro and allow some refactoring of the actual guide,
  • Measure, measure, measure: I starting doing this a couple weeks ago. Now I get an daily e-mail saying what parts of the Guide are running slowly. It’s both a good way to see what needs work, but also how well the caching is working.
  • Abstraction: again, something that I’d like to do in conjunction with the API.

Go rate some stuff!

Tuesday, October 16th, 2007

We rolled out ratings a while ago, but we recently made a big push for them with a Top Rated page and a blog post. Now I’m throwing my hat in the ring.

Currently, there are about 950 ratings (100 of them are “Not Interested”). I’d like to start working on personalized recommendations, but I’d like to have some more data. A lot more data. I’d like about 2000 ratings so that I’ve got a good range for adding this new feature. If you all could rate some more channels, I’d really appreciate it. Also, you’ll appreciate it when the guide can tell you about new channels you’ll like. :)

Weekly Status Report: 9/26-10/2

Tuesday, October 2nd, 2007

This week I worked on fixing some display issues on Firefox and Safari. I also updated the donate bar when we broken $50k. I also finished up a couple scripts which check un-approved users. Old users were grandparented in, but new users have to give a valid e-mail address and confirm it with a code.

The rest of my time has been spent doing caching things. There’s some new code that’ll be going live in the Guide hopefully later tonight which will hopefully make things a little better. Then Ben and I will be discussing what we think the best way to make the guide keep going faster is. I’ve been doing some logging, and I’m getting some good data. Currently the slowest pages overall are the 1st channel submit step, the channel details pages and the subscription-hit pages. The slowest pages to render are the front page, the languages and popular pages, and the channel submit pages. Searching is also slow, but on average it takes about 1 fewer second to render than the popular page.

Better, faster, smarter: Python yesterday, today … and tomorrow

Sunday, September 30th, 2007

I got this video in Miro the other day. It’s a good overview of what’s changed in Python since 2.2.

Caching: database vs. memcached?

Friday, September 28th, 2007

Ben and I have been discussing how to do the caching on Miro Guide. To summarize Matt’s analysis from earlier this month, the most expensive queries we have are doing COUNT() queries over 500k records. Even with the correct indicies, that’s a little expensive when it’s happening periodically (a couple seconds with no load on the DB) and really expensive when it happens a lot (often over 20 seconds). What Matt put together when he was doing that research was a small script which does that calculation and populates a database table with it. Then the guide looks up the value in that table instead of performing the calculation.
What I’ve done is brought that idea into the world of memcached. The popularity values are calculated if they’re not in the cache, and then they’re put there and retrieved. The keys are time-sensitive; they’re only good for 5 minutes for last 24 hour popularity, and an hour for the last month’s popularity. It’s working well, and it’s making the Guide faster. When subscriptions are added, the cache is also incremented, so the cached values stay very close to the actual database values without much effort.
What we are wondering about is what people think about the different approaches, either calculating the values and putting them in a new database table and recalculating them, or putting them into the memory cache. What do you all think?

Caching: Popularity Redux

Friday, September 28th, 2007

I asked for the opinions of others, and I got them. The new caching code calculates the previous 24 hours and 31 days, respectively. Those counts are updated from the DB (to remove old data) every 5 minutes and 60 minutes. It actually wasn’t a difficult change. The only thing that’s different is the cache key. Previously, the month and today counts were based on the date. Now, they’re based on Unix timestamp integer divided by the time interval, so the cache key will change periodically. I don’t have to remove the old data, it’ll simply be removed from the cache eventually when it needs more memory.

Caching: Popular Channels

Thursday, September 27th, 2007

We keep track of subscriptions to channels; it’s how we tell what channels are popular and what channels are similar. It’s also the largest table in our database, and accessing it has gotten increasingly expensive. To find out what channels are the most popular, we have to count through all those records. Today, I rolled out some caching to fix that.
Each channel has three subscription counts: one for all time, one for this month, and one for today. Each of these values is stored in the cache and retrieved when needed. If a value for a channel isn’t found, it’s calculated and placed into the cache. When someone subscribes to a channel, those values are incremented.
The change that’s controversial (at least between myself and BDK) is how the today and month values were calculated. The used to be calculated over the past 24 hours and 31 days respectively. But that means that the values need to be recalculated much more frequently. The new code calculates them from the start of today and the start of the month respectively. This means that the values will be 0 at the start of the day/month. BDK thinks that this is a big deal; I do not.
One idea I had was that if the top-n results (say, 10) are all 0, return the values for the previous time period. This will keep the display from ever being 0, but keep the efficiency.
Update: The old miss ratio was roughly 80%. Now with this new caching, it’s only 6.5%.