Tag Archives: scalability

AiMED Stat Uses Riak to Increase Their Competitive Advantage

March 31, 2014

AiMED Stat is a startup working to facilitate better medical information capture, analysis, and reporting through web and mobile technologies. They provide clinicians with easy-to-use tools and provide researchers with direct access to real-time information capture from the front lines of medicine. They recently worked with the audiology clinic at University of Western Ontario (UWO) and used a Riak system to help the University collect and search data related to the research.

In general, innovation in health research databases has been very stagnant – with many companies simply opting for a legacy relational system like MySQL or PostgreSQL. However, AiMED Stat realized the limitations of these systems. With these relational systems, researchers would need to decide their schemas at the start of studies. However, once researchers were a few months into a study, they would need to update data or collect data in a different way. This meant researchers needed to update the entire table, which involved very costly data migration. As AiMED Stat set out to manage and present research data in a better way, it simply wasn’t feasible for their two-person team to manage a costly data migration every time there was a data update. So they began to look at more flexible, NoSQL databases as a replacement.

They first looked at MongoDB, but soon learned that MongoDB wouldn’t be able to handle their high write volumes without losing data. In clinical research, data loss is never acceptable as it can skew results. They then looked at Cassandra; however, for a small team, they found Cassandra to be too complex to operate efficiently. Finally, they evaluated Riak. They were immediately drawn to Riak’s flexible data model, schemaless design, and ability to scale out quickly. In 2011, they brought Riak into production as the backend of their research data application.

“We set out to create an application that stores and queries data in a way researchers understand,” said Kartik Thakore, Co-Founder at AiMED Stat. “By using Riak to power our application, it gives us a sizable competitive advantage (relative to other electronic audiograms). Its flexibility allows us to store data exactly as needed, its ease-of-scale eliminates the chunk of our budget previously dedicated to data migration, and its high availability ensures we never have to worry about losing data. Riak is a breath of fresh air – it does exactly what we need it to do.”

Their Riak application enables rich HTML5 forms for data collection, using a method that increases compliance and data integrity at the point of capture. From data collection, demographic identifiers are used as the key in Riak and values are stored as JSON. Riak post- and pre-commit hooks are used to further validate the data. Additionally, Riak Search, Secondary Indexes, and MapReduce are all used to allow researchers to store and search data (via a D3.js enabled application) using an Audiogram shown below:

Audiogram

(Audiogram shows Frequency vs. Decibel and uses the ANSI Symbol Legend)

This Audiogram allows researchers to easily search within the graph to find and compare patients that match certain audiological profiles. The quicker researchers can find patients for their study, the quicker they can get funding, making this queryability imperative.

AiMED Stat is currently running five-nodes in production and looking to scale out as they grow. “For us, the importance is not on big data but on never losing data,” continued Kartik. “With Riak, we can rest assured that all our data is archived and accessible, regardless of scale or write volume.”

Basho

Relational to Riak – Cost of Scale

November 14, 2013

This series of blog posts will discuss how Riak differs from traditional relational databases. For more information about any of the points discussed, download our technical overview, “From Relational to Riak.” The previous post in the series was Relational to Riak – High Availability.


Riak is designed for scalability, which truly separates it from relational systems. As described in the previous post, relational databases run best on a single server. If the dataset grows beyond the capacity of this single machine, it can become prohibitively expensive (or even impossible) to simply upgrade to a bigger machine. In such a scenario, the only option may be to add more machines and divide the dataset across them using a technique called sharding.

Sharding divides data into logical parts (such as alphabetical, by customer, or by geographic region) that can be distributed across multiple machines – often manually. If data continues to grow, this process may need to be repeated at great expense.

Sharding is not only difficult, it also will typically lead to hot spots – meaning certain machines are responsible for storing and serving a disproportionately high amount of both data and requests. Hot spots can cause unpredictable latency and degraded performance.

(And remember all the ways in which availability is a challenge? Combine sharding with a master/slave architecture for maximal expense and general unpleasantness.)

Instead of sharding, Riak evenly distributes data across a cluster using consistent hashing. In a Riak cluster, the data space is divided into partitions which are claimed by the servers. When new data is written to the database, these objects are evenly placed around the ring and replicated 3 times (by default). This ensures that your data will always be available, even when nodes fail.

When nodes are added or removed, data is rebalanced automatically. New machines assume ownership of some of the partitions and existing machines hand off relevant partitions and associated data until data ownership is equal amongst nodes.

By eliminating the manual requirements of sharding and making hot spots highly unlikely, Riak makes it significantly easier for companies to scale, whether it’s just for a few months to handle peak loads or to support long-term growth strategies.

Basho

Riak in Production – Retail and eCommerce Stories

January 29, 2013

This is the first in a series of blog posts covering the benefits Riak offers to developers and operators of retail and eCommerce platforms. To learn more, join our “Retail on Riak” webcast on Friday, February 8th.

As retailers grow and have to store more and more data, traditional relational databases aren’t always the best option. Retailers want to scale easily, without the operational burden of manual sharding. Meanwhile, business requirements demand their data is always available for reads and writes. Riak is a highly available, low latency distributed database that is ideal for retailers who need to serve product data quickly and maintain “always on” shopping experiences. Riak is based on architectural principles from Amazon. Riak is designed for high availability and scale so retailers can always serve customers, even under failure conditions, and rapidly grow to meet peak loads.

Retailers of all sizes have chosen Riak to power parts of their business, including:

  • Best Buy: Best Buy is North America’s top specialty retailer of consumer electronics, personal computers, entertainment software, and appliances. Riak has been an integral part in the transformation push to re-platform Best Buy’s eCommerce platform. For more info, check out Best Buy’s talk from our 2012 developer conference, RICON.
  • ideeli: ideeli is one of the fastest growing retailers with over 5 million members and more than 1,000 brand partners. They use Riak to serve HTML documents and user-specific products. ideeli chose Riak to power their event-based shopping experience due to Riak’s ability to serve users information at low latency and provide ease of use and scale to ideeli’s operations team. Check out the complete case study for more details.
  • Copious: Copious is a social commerce marketplace that makes it easy for people to buy and sell the things they love. They currently store all registered accounts in Riak as well as the tokens that make it possible for users to authenticate with Copious via their Facebook or Twitter accounts. They chose to use Riak for their social login functionality because of its operational simplicity, which allows them to easily scale up without sharding and provides the high availability required for a smooth user experience. For more details, check out the complete Copious story on our blog.

For more information about the benefits of Riak for retailers and the retailers already using it, register for our “Retail on Riak” webcast on February 8th!

Basho

Riak for Retail and eCommerce Platforms

January 22, 2013

Traditionally, most retailers have used relational databases to manage their platforms and eCommerce sites. However, with the rapid growth of data and business requirements for high availability and scale, more retailers are looking at non-relational solutions like Riak.

Riak is a masterless, distributed database that provides retailers with high read and write availability, fault-tolerance and the ability to grow with low operational cost. Architectural, operational and development benefits for retailers include:

  • “Always On” Shopping Experience: Based on architectural principles from Amazon, Riak is designed to favor data availability, even in the event of hardware failure or network partition. For retailers, failure to accept additions to a shopping cart, or serve product information quickly, has a direct and negative impact on revenue. Riak is architected to ensure the system can always accept writes and serve reads at low-latency.
  • Resilient Infrastructure: At scale, hardware malfunction, network partition, and other failure modes are inevitable. Riak provides a number of mechanisms to ensure that retail infrastructure is resilient to failure. Data is replicated automatically within the cluster so nodes can go down but the system still responds to requests. This ensures read and write availability, even in serious failure conditions.
  • Low-Latency Data Storage: Many retailers now operate online and mobile experiences with an API or data services platform. In order to provide a fast and available experience to end users, Riak is designed to serve predictable, low-latency requests as part of a service-oriented infrastructure and is accessible via HTTP API, protocol buffers, or Riak’s many client libraries.
  • Scale to Peak Loads with Low Operational Cost: During major holidays and other periods of peak load, retailers may have to significantly increase their database capacity quickly. When new nodes are added, Riak automatically distributes data evenly to naturally prevent hot spots in the database, and yields a near-linear increase in performance and throughput when capacity is added.
  • Global Data Locality and Redundancy: Riak Enterprise’s multi-site replication allows replication of data to multiple data centers, providing both a global data footprint and the ability to survive datacenter failure.

Top retailers using Riak include Best Buy and ideeli. Best Buy selected Riak as an integral part in the transformation push to re-platform its eCommerce platform. For more information about how Best Buy is using Riak, check out this video.

ideeli uses Riak to serve HTML documents and user-specific products. ideeli chose Riak to provide its highly available, event-based shopping experience – Riak gives them the ability to serve user information at low latency and provides ease of use and scale to ideeli’s operations team. For more information on ideeli’s use of Riak check out the complete case study.

Common use cases for Riak in the retail/eCommerce space include shopping carts (due to Riak’s “always-on” capabilities), product catalogs (Riak is well suited for the storage of rapidly growing content that needs to be served at low-latency), API platforms (Riak’s flexible, schemaless design allows for rapid application development), and mobile applications (Riak is ideal for powering mobile experiences across platforms due to its low-latency, always-available small object storage capabilities).

To help retailers evaluate and adopt Riak, we’ve published a technical overview: “Retail on Riak: A Technical Introduction.” We discuss more in-depth information on modeling applications for common use cases, switching from a relational architecture, querying, multi-site replication and more.

Basho

From Relational to Riak (Webcast)

**January 02, 2013**

New to Riak? Thinking about using Riak instead of a relational database? Join Basho chief architect Andy Gross and director of product management Shanley Kane for an intro this Thursday (11am PT/2pm ET). In about 30 minutes, we’ll cover the basics of:

* Scalability benefits of Riak, including an examination of limitations around master/slave architectures and sharding, and what Riak does differently
* A look at the operational aspects of Riak and where they differ from relational approaches
* Riak’s data model and benefits for developers, as well as the tradeoffs and limitations of a key/value approach
* Migration considerations, including where to start when migrating existing apps
* Riak’s eventually consistent design
* Multi-site replication options in Riak

Register for the webcast [here](http://info.basho.com/RelationalToRiakJan3.html).

[Shanley](http://twitter.com/shanley)

[Andy](https://twitter.com/argv0)

Soap people have to grow up quick: Two Weeks in the Life of a NoSQL Company

November 11, 2010

Things are moving incredibly fast in the NoSQL space. I am used to internet-fast — helping bring on 300 customers in a year at Akamai; going from adult bulletin boards and leased lines to hosting sites for twenty percent of the Fortune 500 at Digex (Verizon Business) in eighteen months. I have never seen a space explode like the NoSQL space.

Two weeks ago, Justin Sheehy stood on stage delivering a rousing and thoughtful presentation to the NoSQL East Conference that was less about Riak and more about a definition of first principles that underpinned Riak: what it REALLY means when you claim such terms as scalability (it doesn’t mean buying a bigger machine for your master DB) and fault-tolerance (it has to apply to writes and reads and is binary; you either always accept writes and serve reads or you don’t). The conference was a bit of a coming out party for Basho, which co-sponsored the event with Rackspace, Georgia Tech, and a host of other companies. We had been working on Riak for 18 months or so in relative quiet and it was nice to finally see what people thought, first hand.

There were equally interesting presentations about Pig and MongoDB and a host of other NoSQL entrants, all of which will make for engrossing viewing when they finally get posted. We were told this wasn’t quite as exciting as the NoSQL conference out West but none of us seemed to mind. Home Depot, Turner Broadcasting, Weather.com, and Comcast had all sent folks down to evaluate the technology for real, live problems and the enthusiasm in the auditorium spilled out into the Atlanta bars. Business cards were exchanged, calls set up, even a little business discussed. Clearly, NoSQL databases were maturing fast.

No sooner had we returned to Cambridge than news of Flybridge’s investment in 10Gen came out. Hooray! Someone was willing to bet a $3.4 million dollars on a company in the space. Chip Hazard, ever affable, wrote a nice blog post explaining the investment. According to him, every developer they talked to had downloaded some NoSQL database to test. Brilliant news. He said Flybridge invested in 10Gen because they liked the space and knew the team from their investment in Doubleclick, from whose loins the management team at 10Gen issued. No more felicitous reason exists for a group of persons to invest $3.4 million than that previous investments in the same team were handsomely rewarded. I would wish Chip and 10Gen the best if I had time.

Because contemporaneous with the news of Flybridge’s investment, and almost as if the world had decided NoSQL’s time had come, we began to field emails and calls from interested parties. Trials, quotes, lengthy discussions about features and uses of Riak — the week was a blur. Everyone was conducting a bakeoff: “I have a 4TB database and customers in three continents. I am evaluating Riak and two other document datastores. Tell me about your OLAP features.”

Heady times and, frankly, of somewhat dubious promise, if you ask me. Potential clients that materialize so quickly always seem to disappear just as fast. Really embracing a new technology requires trials, tests, new features, and time. Time most off all. These “bluebirds” would fly away in no time, if my experience held true.

Except, this time it didn’t happen. Contracts were exchanged. Pen nibs were sharpened. It is as if the entire world decided to not wait for the everyone else to jump on the bandwagon and instead, decided to go NoSQL. Even using this last week as the sole example, I think the reason is plain — people have real pain and suddenly the word is out that they no longer have to suffer.

Devs are constrained by what they can build, rich features notwithstanding. Ask the company that had to choose between Riak and a $100K in-memory appliance to scale. And Ops is getting slaughtered — the cost of scaling poorly (and by poorly I mean pagers going off during dinner, bulk updates taking hours and failing all the time, fragmented and unmanageable indices consuming dozens of machines) is beginning to look like the cost of antiquated technology. Good Ops people are not fools. They look for ways to make life easier. Make no mistake — all the Devs and Ops folks came with a set of tough questions and a list of new features. They also came with an understanding that companies that release open source software still have a business to run. They are willing to spend on a real company. In fact, having a business behind Riak ended up mattering as much as any features.

So, I suspect, we are at the proverbial “end of the beginning.” Smart people in the NoSQL movement have succeeded in building convincingly good software and then explaining the virtues convincingly (all but one of the presentations at NoSQL East demonstrated the virtues of the respective approaches). Now these people are connecting to smart people responsible for building and running web apps, people who are decidedly unwilling to sit around hoping for Oracle or IBM to solve their problems.

In the new phase — which we will cleverly call the “beginning of the middle” — great tech will matter even more than it does now. It won’t be about selling or marketing or any of that. If our numbers are any indication of a larger trend, more people will download and install NoSQL databases in the next month than the combined total of the three months previous. More people in a buying frame of mind will evaluate NoSQL technology not in terms of its coolness but in terms of its ability to solve their real, often expensive problems. The next phase will be rigorous in a way this phase was not. People have created several entirely new ways to store and distribute data. That was the easy part.

Just as much as great tech, the people behind it will matter. That means more calls between us and Dev teams. That means more feature requests considered and, possibly, judiciously, agreed to.

That also means lots of questions answered. People care about support. They care about whether you answer their emails in a timely fashion and are polite. People want to do business with NoSQL. They want to spend money to solve problems. They need to know they are spending it with responsible, responsive, dedicated people.

Earl tweets about it all the time and I happen to agree: any NoSQL success helps all NoSQL players. I also happen to feel that any failure hurts all NoSQL players. As NoSQL rapidly ages into its adolescence, it will either be awkward and painful or exciting and characterized by incredible growth.

When I was a kid on the Navy base in Alameda, my babysitter watched soaps all afternoon, leaving me mostly to my own devices. If I stopped in, I always got roped in to hearing her explain her favorite stories. Most of all she loved how ridiculous they were, though she would never admit this exactly. Instead, adopting an attitude of gleeful incredulity, she would point out this or that attractive young actor and tell me how just a year ago, she was a little baby. “Soap people have to grow up quick, I guess,” was her single (and to her, completely satisfactory) explanation. “If they don’t, they get written out of the story.”

Indeed.

Best,

Tony Falco

Berlin Buzzwords Day One Recap

June 8, 2010

This was originally posted by @rklophaus on his blog, rklophaus.comBerlinBuzzwords has a stellar venue and talks describing cutting edge developments on all things search, scalability, and storage. My recap of Day 1 is below.

Check back for a writeup on part 2.

Day 1 – June 7, 2010

The conference check-in was seamless, with much swag including messenger bags and notebooks. The venue itself–Kosmos Club–was amazing. Kosmos Club was the biggest movie house in East Germany before the fall of the wall, and has since turned into an event venue. Lots of metallics, varied textures, and swank chandeliers, with two of the biggest presentation rooms I’ve ever seen (they used to be movie theatres.)

Isabel Drost, Jan Lehnardt, and Simon Willnauer kept the opening remarks light and intimate. Overall, there were about 350 people in attendance.

Keynote: Grant Ingersoll from Lucid Imagination

Grant focused his talk around the words “Open”, “Scalable”, and “Intelligent”. He described how a number of things, such as big-data storage, search, and distributed computing have become commodities, but required a staff of Ph.D. level employees just a few years ago.

The main point of his talk is that openness and open-source, plus scalability, have turned these things into commodities, and that the next big, interesting thing to work on is data intelligence.

  • Data (produced per year, I believe) has grown from 161 exabytes in 2006 to ~1000 exabytes in 2010.
  • 85% of data we produce is unstructured, where unstructured may mean that we just aren’t yet smart enough to parse the data.

There are multiple levels to intelligence:

  • Level 1 – Finding, organizing, discovering and associating data
  • Level 2 – Collecting and personalizing data
  • Level 3 – Mining data for sentiment and semantics
  • Level 4 – Learning from data, extracting ideas
  • Level 5 – Reasoning about data

At this point in the talk, he switched to a description of Apache Mahout, a machine-learning engine. Mahout can do things such as recommendations, collaborative filtering, Bayesian analysis, Random Forests, discovery, and pattern matching. At some point, he mentioned things like Restricted Boltzmann Machines, Stachastic Gradient Dsecents, and Vector Machines as upcoming features.

The takeaway is that using Mahout, you can build open, scalable, intelligent apps right now. In practical terms, this means things like auto-suggest, auto-complete, content clustering, clickstream analysis, etc.

NoSQL: The Definitive Guide – Mathias Mayer

Mathias Meyer (Peritor Solutions, @roidrage) gave a very balanced view of the current state of NoSQL. Mathias gave history its proper respect, saying that much of what we view as “new” in NoSQL is actually older ideas in a prettier package. He mentioned things like Berkely DB (K/V store), Sybase (Column Store), Versant Object DB (Graph/Object Database), and Lotus Notes (Peer to Peer Document Database) as throwback examples.

Mathias said that relational databases tried to be a one-size-fits all solution, and that NoSQL is about removing constraints to speed up performance and development.

Mathias himself is a fan of CouchDB, Redis, and Riak, and wisely avoided giving specific recommendations on what projects someone should use. Each datastore handles different use cases, so use what is right for your data.

Mathias briefly touched on Voldemort, Tokyo, Redis, S3, Scalaris, Couhdb, Riak, Mongo, BigTable, Cassandra, HBase, HyperTable, Core Data, Neo4J, and HyperGraphDB. (In order of mention.) He then gave a slightly more in depth view of the replication and scaling models of both CouchDB and Dynamo.

One of the key things that NoSQL gets right, he said, is being constructed of open web technologies such as JSON, HTTP, links, and textual protocals.

What is hard for NoSQL now? Range queries, ad-hoc queries, and transactions, mainly because the NoSQL space is focused on scalability as a major goal.

One of the big points that I’m glad he brought up is, “As a developer, how do I know I’m not wasting my time on a NoSQL solution?” The key is that each of the different NoSQL projects was built to solve a real-world problem, so trust that somebody found it useful and needed it built.

His main point: NoSQL is not the Holy Grail. NoSQL should not be about replacing SQL. Instead, you need to be okay with having polyglot data storage. The data itself should dictate the datastore.

Making Software for Humans – CouchDB and the Usable Peer-To-Peer Web – Jan Lehnhardt

Jan Lehnardt (@janl, Couch.io), after a brief introduction of himself and Couch, led off with the statement that 80% of all NoSQL projects do the same thing as flat files. It’s the differences in the last 20% that really differentiate the projects. Therefore, NoSQL is about choice to build better systems. Each NoSQL project starts with a main idea.

According to Jan, CouchDB’s main idea is being ready to scale up, in that each node functions independently, and also ready to scale down, in that Couch is a great candidate for running on embedded mobile phones, routers, and other devices, as a way of synchronizing user content.

CouchDB’s synchronization is it’s killer feature. With Couch, you can subscribe to events like data updates that send out HTTP based notifications to other parts of your application. As an example, the Couch.io team has built a chat service based solely on writing Couch objects and receiving updates.

Jan wrapped up by touching on projects like Opera Unite as ahead-of-the-parade examples of what CouchDB is aiming to do, and is already doing, in projects like UbuntuOne–P2P data synchronization. He mentioned Facebook as an example of a centralized, closed web, Flickr as a centralized, but open service (since you can pull your data out), and Diaspora as the open web that follows the true vision of Tim Berners-Lee.

Riak from Small to Large – Rusty Klophaus

In my talk (Rusty Klophaus, Basho Technologies), I gave a brief description of how Riak differentiates itself in the NoSQL market. It was built first with the operations folks in mind, which makes sense given the Akamai background of the core developers, who understand and embrace the uncertainties of a distributed system.

I then described which features of Riak become important in single-node Riak clusters, three-node clusters, and ten-plus node clusters. Just like different features of your car are important going 50 m.p.h. vs. 100 m.p.h., different features of Riak are important at different cluster sizes.

In single node clusters, Riak provides a simple data model, with key/value access, a variety of client libraries (Python, PHP, Java, Javascript, Erlang), and configurable replication settings and backing datastores.

In small-to-medium sized clusters, Riak provides a way to take advantage of hardware in parallel, with Javascript-based Map/Reduce, well-behaved HTTP (allowing easy placement of proxies and caches), and Google Protobuffs support.

And in large-clusters, Riak provides an extremely easy operations story that can survive server outages and network partitions, and scale out by just running a few commands.

Finally, I ended with a 5-minute run through of how to use the Python client API to read/write an object, run a linkwalking operation, and run a Javascript-based map/reduce operation.

Slides are available on Slideshare.

Realtime Search with Lucene – Michel Busch

Michael Busch from Twitter discussed some upcoming changes to Lucene that allow it to search on data that has not yet been committed to disk. From my understanding, when Lucene commits a change it creates a segment and possibly merges multiple segments together. At that point, a reader can access the newly created segment.

For real-time search, the process needed to be shortened. The first attempt simply involved syncing out the changes when a reader was created. This didn’t work well. The next step was to actually search on the uncommitted index in memory. This was a challenge for a few reasons: first, Lucene uses multiple threads to update the index, so synchronizing those threads to provide the correct read-isolation is a problem. Second, the index maintains a large number of long lived objects in memory, and this causes inefficient garbage collection that kills performance.

Michael described a number of fixes that have already been written or are on their way, mostly around making multiple single-threaded index writers and changing the way postings are stored in memory which changes their structure from an unbounded number of objects to instead use a finite number of arrays. Effects on performance were amazing, especially for small memory sizes. A JVM with ~200 MB of RAM allocated was something like ~80% more performant.

A Twitter prototype with simultaneous indexes and queries showed that query performance, with the new modifications, is almost completely independent from query load, which is impressive. That said, the Twitter index has 32-bit postings (24 for DocID, 8 for term postition) so these results may not be the same for everyone.

ElasticSearch – Shay Banon

I unfortunately missed the first few minutes of Shay’s talk on ElasticSearch. ElasticSearch automates the partitioning, sharding, and replication of documents into Lucene indexes, and provides a unified interface for searches.

Shay described the JSON model that ElasticSearch uses for API access, which includes everything from queries and filters to creating new indexes.

The ElasticSearch distribution model works by posting a JSON document with the new index definition. ElasticSearch automatically balances the shards across available nodes, and it sounded like this is done in a node-aware way, so that replicas are stored on different nodes if possible. At index or query time, you can hit any node, and the node itself is responsible for directing the operation to the right place(s).

ElasticSearch supports per-document consistency (in other words, no commit/flush support.) The most interesting thing, to me, is that ElasticSearch embraces the idea of transient storage. In other words it was written with the intent of running on something like EC2, where you can’t trust local storage, and writing to remote storage is expensive, both computationally and monetarily.

To get around this potential bottleneck, ElasticSearch still supports reliable (or “somewhat reliable”, if you want to split hairs) persistence by assuming that not all replicas containing a node will fail at the same time. New documents are lazily logged to the backing store, and if the master node doing the logging dies, then one of its slaves will detect the death and finish the logging. When a new node starts up, it reads from the backing store.

The other interesting thing is that ElasticSearch is aware of different cloud providers (it sounded like EC2 and Rackspace Cloud) and consult the cloud provider itself to get a list of potential nodes, allowing it to self-assemble.

Key differences between ElasticSearch and Solr? Distributed model, different API, no facet support yet.

Nutch as a Web Mining Platform – Andrzej Bialecki

Andrzej Bialecki (SIGRAM) described Nutch, a distributed web-crawler/search engine built on top of Lucene. It provides the standard framework of a distributed search engine that is extensible by plug-ins, and handles things such as URL filtering, normalizing, depth vs. breadth first crawling, etc.

The presentation was eye-opening just to see all of the things that make web-crawling difficult that are NOT about storing data and serving up queries. The Nutch team has spent a large amount of time going into things like what happens when you encounter auto-generated sites, buggy web-servers, link-spammers, or other tar-pits during a crawl.

Some common techniques for “bootstrapping” a web crawler are to start with some high quality seed sites, which may be well-known, authoritative resources, reference sites, or even the top-N results from an existing search engine like Google.

Once you have your search data, Andrzej described ways to mine the data, such as using keyword, phrase, or anchor search, using facets to find latent topics, using top-N results to prioritize future crawling, mining incoming links, treating the web as a corpus of sample textual data, associating concepts, uncoving gossip, opinions, and sentiments from data.

Nutch is currently under a redesign, attempting to share more code with common crawler libraries and other projects. Part of this will be converting data storage to a well-defined layer, allowing for pluggable backends so that users can take advantage of native data tools for those backends.

Riak Search – Rusty Klophaus

In my second talk, I discussed Riak Search, a distributed indexing and full-text search engine built on (and complementary to) Riak.

Part one covered the main reason for building Riak search: clients have built applications that eventually need to find data by value, not just by key. This is difficult, if not impossible, in a key/value store.

Part two described the shape of the final solution we set out to create. The goal of Riak Search is to support the Lucene interface, with Lucene syntax support and Solr endpoints, but with the operations story of Riak. This means that Riak Search will scale easily by adding new machines, and will continue to run after machine failure.

Part three was an introduction to Inverted Indexing, which is the heart of all search systems, as well as the difference between Document-Partitioning and Term-Partitioning, which forms the ongoing battle in the distributed search field.

The tradeoffs are that Document-Partitioning generally has lower latency queries, but lower overall throughput due to it requiring a disk-seek on each partition. For this reason, Riak Search uses Term-Based partitioning, with some special optimizations using term-splitting, bloom filters, and result batching.

Slides available soon.

 

Talks I Wished I Had Attended

The conference schedule today had two tracks, Search and NoSQL, plus I presented two talks, so there were a number of talks I was not able to attend. I would have liked to see the talks below, and look forward to the conference video:

  • Finite-State Queries in Lucene – Robert Muir
  • Text and metadata extraction with Apache Tika – Jukka Zitting
  • MetaCarta GeoSearch Toolkit for Solr – James Goodwin
  • The return of the Hierarchical Model – Jukka Zitting
  • Five cool problems you can solve with Neo4J – Peter Neubauer

Basho Podcast Number 1 – Justin Sheehy and Tony Falco on Scaling out with Riak and Riak Search

December 11, 2009

Just out: Basho’s first podcast discussing Riak. Justin Sheehy and Tony Falco revisit the definition of scalability Justin first discussed at NoSQL East 2009, discuss EC2, Riak, and Riak’s map/reduce and soon-to-be-released distributed search and indexing. As a special bonus, at 3:24 in the podcast, listen for the sound of Kevin Smith’s SMS accepting the job at Basho. The mic did not pick up Justin’s grimace. Of course, he didn’t miss a beat. “I just did, Bob….”

Enjoy,

Mark Phillips



Right click here to download the Podcast

Justin Sheehy's Presentation at NoSQL East Now Available for Viewing

November 18, 2009

Thanks to the hardwork of Brad Anderson and the rest of the NoSQL East organizers, Justin Sheehy’s presentation is now online and ready to be consumed. You can check it out here: http://bit.ly/2wDhWs

Justin spends a little time discussing Riak and then quickly moves on to a discussion of first principles.

Justin’s presentation stands on its own but it is worth pointing out: terms like “scalable” and “distributed” and “fault tolerant” are not marketing terms. Applied rigorously, the principles underlying them (a hat tip to folks like Brewer, Lewin/Leighton/Karger et. al.) lead to game-changing software.

Building truly decentralized systems requires discipline. Shortcuts for premature optimization ultimately lead to a dead end.

Enjoy,

Tony Falco