What the Heck Are You Actually Using NoSQL For

rw-book-cover

Metadata

Author: highscalability.com
Full Title: What the Heck Are You Actually Using NoSQL For?
URL: http://highscalability.com/blog/2010/12/6/what-the-heck-are-you-actually-using-nosql-for.html

Highlights

When something becomes so massive that it must become massively distributed, NoSQL is there, though not all NoSQL systems are targeting big. Bigness can be across many different dimensions, not just using a lot of disk space (View Highlight)
Massive write performance. This is probably the canonical usage based on Google’s influence. High volume. Facebook needs to store 135 billion messages a month. Twitter, for example, has the problem of storing 7 TB/data per day with the prospect of this requirement doubling multiple times per year. This is the data is too big to fit on one node problem. At 80 MB/s it takes a day to store 7TB so writes need to be distributed over a cluster, which implies key-value access, MapReduce, replication, fault tolerance, consistency issues, and all the rest. For faster writes in-memory systems can be used. (View Highlight)

New highlights added April 24, 2023 at 1:49 PM

When latency is important it’s hard to beat hashing on a key and reading the value directly from memory or in as little as one disk seek. Not every NoSQL product is about fast access, some are more about reliability, for example. but what people have wanted for a long time was a better memcached and many NoSQL systems offer that. (View Highlight)
NoSQL products support a whole range of new data types, and this is a major area of innovation in NoSQL. We have: column-oriented, graph, advanced data structures, document-oriented, and key-value. Complex objects can be easily stored without a lot of mapping. (View Highlight)
Lack of structure allows for much more flexibility (View Highlight)
Schemalessness makes it easier to deal with schema migrations without so much worrying. Schemas are in a sense dynamic, because they are imposed by the application at run-time, so different parts of an application can have a different view of the schema. (View Highlight)
This is very product specific, but many NoSQL vendors are trying to gain adoption by making it easy for developers to adopt them. They are spending a lot of effort on ease of use, minimal administration, and automated operations (View Highlight)
Not every product is delivering on this, but we are seeing a definite convergence on relatively easy to configure and manage high availability with automatic load balancing and cluster sizing. (View Highlight)
Generally available parallel computing. We are seeing MapReduce baked into products, which makes parallel computing something that will be a normal part of development in the future. (View Highlight)
While the relational model is intuitive for end users, like accountants, it’s not very intuitive for developers. Programmers grok keys, values, JSON, Javascript stored procedures, HTTP, and so on. NoSQL is for programmers. (View Highlight)
Use the right data model for the right problem. Different data models are used to solve different problems. Much effort has been put into, for example, wedging graph operations into a relational model, but it doesn’t work. (View Highlight)
NoSQL systems, because they have focussed on scale, tend to exploit partitions, tend not use heavy strict consistency protocols, and so are well positioned to operate in distributed scenarios. (View Highlight)
Tunable CAP tradeoffs. NoSQL systems are generally the only products with a “slider” for choosing where they want to land on the CAP spectrum. Relational databases pick strong consistency which means they can’t tolerate a partition failure. In the end this is a business decision and should be decided on a case by case basis. Does your app even care about consistency? Are a few drops OK? Does your app need strong or weak consistency? Is availability more important or is consistency? Will being down be more costly than being wrong? It’s nice to have products that give you a choice. (View Highlight)
More Specific Use Cases (View Highlight)
Managing large streams of non-transactional data: Apache logs, application logs, MySQL logs, clickstreams, etc. (View Highlight)
Syncing online and offline data. This is a niche CouchDB has targeted. (View Highlight)
Fast response times under all loads. (View Highlight)
Avoiding heavy joins for when the query load for complex joins become too large for a RDBMS. (View Highlight)
Soft real-time systems where low latency is critical. Games are one example. (View Highlight)
Load balance to accommodate data and usage concentrations and to help keep microprocessors busy. (View Highlight)
Hierarchical data like threaded discussions and parts explosion. (View Highlight)
Dynamic table creation. (View Highlight)
Two tier applications where low latency data is made available through a fast NoSQL interface, but the data itself can be calculated and updated by high latency Hadoop apps or other low priority apps. (View Highlight)
Sequential data reading. The right underlying data storage model needs to be selected. A B-tree may not be the best model for sequential reads. (View Highlight)
Slicing off part of service that may need better performance/scalability onto it’s own system. For example, user logins may need to be high performance and this feature could use a dedicated service to meet those goals. (View Highlight)
Caching. A high performance caching tier for web sites and other applications. (View Highlight)
Real-time page view counters. (View Highlight)
Document, catalog management and content management systems. These are facilitated by the ability to store complex documents has a whole rather than organized as relational tables. Similar logic applies to inventory, shopping carts, and other structured data types. (View Highlight)
Archiving. Storing a large continual stream of data that is still accessible on-line. Document-oriented databases with a flexible schema that can handle schema changes over time. (View Highlight)
Analytics. Use MapReduce, Hive, or Pig to perform analytical queries and scale-out systems that support high write loads. (View Highlight)
Working with heterogenous types of data, for example, different media types at a generic level. (View Highlight)
Federal law enforcement agencies tracking Americans in real-time using credit cards, loyalty cards and travel reservations. (View Highlight)
Fraud detection by comparing transactions to known patterns in real-time. (View Highlight)
Helping diagnose the typology of tumors by integrating the history of every patient. (View Highlight)
In-memory database for high update situations, like a web site that displays everyone’s “last active” time (for chat maybe). If users are performing some activity once every 30 sec, then you will be pretty much be at your limit with about 5000 simultaneous users. (View Highlight)
Computing the intersection of two massive sets, where a join would be too slow. (View Highlight)
• A timeline ala Twitter. (View Highlight)
Redis Use Cases Redis is unique in the repertoire as it is a data structure server (View Highlight)
Calculating whose friends are online using sets. (View Highlight)
Distributed lock manager for process coordination. (View Highlight)
Full text inverted index lookups. (View Highlight)
Tag clouds. (View Highlight)
Leaderboards. Sorted sets for maintaining high score tables. (View Highlight)
Circular log buffers. (View Highlight)
- Note: A circular log buffer is a data structure used in computer programming and system design for efficiently managing a fixed-size log or memory space. It is called “circular” because, when the buffer reaches its maximum capacity, it wraps around and starts overwriting the oldest entries with new data, maintaining a continuous flow of information without ever growing beyond its predefined size. Here’s a brief explanation of the concept: Fixed-size buffer: A circular log buffer has a predetermined size, typically defined by the number of entries or bytes it can hold. This ensures that the buffer does not consume an excessive amount of memory and helps in efficiently managing system resources. Circular overwrite: When the buffer is full and a new entry needs to be added, the oldest entry in the buffer is overwritten by the new data. This process continues in a circular manner, always replacing the oldest entry as new data comes in. Read and write pointers: To manage the circular log buffer, two pointers are used – the read pointer and the write pointer. The write pointer indicates the next position in the buffer where new data will be written, while the read pointer indicates the position of the next data to be read. Both pointers move forward through the buffer and wrap around to the beginning when they reach the end. Benefits: Circular log buffers are advantageous in situations where a continuous flow of data needs to be maintained, such as in logging systems, embedded devices, or real-time applications. They allow for efficient memory management, prevent data loss by continuously overwriting old data, and enable quick access to the most recent information. Challenges: One potential drawback of circular log buffers is that if the buffer size is too small, relevant data may be overwritten before it can be read or processed. Therefore, it is important to balance the size of the buffer with the rate at which data is produced and consumed in order to minimize the risk of losing important information.
Server for backed sessions. A random cookie value which is then associated with a larger chunk of serialized data on the server) are a very poor fit for relational databases. They are often created for every visitor, even those who stumble in from Google and then leave, never to return again. They then hang around for weeks taking up valuable database space. They are never queried by anything other than their primary key. (View Highlight)
- Note: The text you provided is not a code snippet, but rather a description of the issues related to using relational databases for storing session data in a server-side web application. Let’s break down the explanation based on the given text. Server for backed sessions: In the context of web applications, a “backed session” is a session where the data is stored on the server rather than the client (browser). When a user interacts with the web application, the session data is stored and maintained on the server to keep track of the user’s state. Random cookie value: Cookies are small pieces of data that a server sends to a user’s browser to store information about the user’s session. In this case, a random value is used as a unique identifier for the session. The browser then sends this cookie value back to the server with each request, allowing the server to associate the request with the correct session data. Serialized data: The session data stored on the server is often serialized, which means converting it into a format that can be easily stored and transmitted. This is done to save and retrieve the data efficiently. Poor fit for relational databases: Storing session data in a relational database can be inefficient due to several reasons: a. High volume of session data: Sessions are created for every visitor, even those who leave the website without any meaningful interaction. This leads to a large number of session records being created and stored in the database. b. Short lifespan: Session data typically has a short lifespan (e.g., a few hours or days), meaning the records become obsolete quickly and need to be deleted to free up database space. c. Limited querying: Session data is typically only queried using its primary key (the random cookie value), which doesn’t take advantage of the relational database’s strengths in organizing and querying data based on relationships. Due to these reasons, it is often more efficient to use alternative storage solutions for session data, such as in-memory data stores like Redis or Memcached. These solutions are designed to handle high volume, short-lived, and key-value based data, making them a better fit for storing session data in web applications.
Fast, atomically incremented counters are a great fit for offering real-time statistics. (View Highlight)
Transient data. Any transient data used by your application is also a good fit for Redis. CSRF tokens (to prove a POST submission came from a form you served up, and not a form on a malicious third party site, need to be stored for a short while, as does handshake data for various security protocols. (View Highlight)
Incredibly easy to set up and ridiculously fast (30,000 read or writes a second on a laptop with the default configuration) (View Highlight)
Share state between processes. Run a long running batch job in one Python interpreter (say loading a few million lines of CSV in to a Redis key/value lookup table) and run another interpreter to play with the data that’s already been collected, even as the first process is streaming data in. You can quit and restart my interpreters without losing any data. (View Highlight)
Redis semantics map closely to Python native data types, you don’t have to think for more than a few seconds about how to represent data. (View Highlight)
That’s a simple capped log implementation (similar to a MongoDB capped collection)—push items on to the tail of a ’log’ key and use ltrim to only retain the last X items. You could use this to keep track of what a system is doing right now without having to worry about storing ever increasing amounts of logging information. (View Highlight)
It’s common to use MySQL as the backend for storing and retrieving what are essentially key/value pairs. I’ve seen this over-and-over when someone needs to maintain a bit of state, session data, counters, small lists, and so on. When MySQL isn’t able to keep up with the volume, we often turn to memcached as a write-thru cache. But there’s a bit of a mis-match at work here. (View Highlight)
With sets, we can also keep track of ALL of the IDs that have been used for records in the system. (View Highlight)
Quickly pick a random item from a set. (View Highlight)
API limiting. This is a great fit for Redis as a rate limiting check needs to be made for every single API hit, which involves both reading and writing short-lived data. (View Highlight)
A/B testing is another perfect task for Redis - it involves tracking user behaviour in real-time, making writes for every navigation action a user takes, storing short-lived persistent state and picking random items. (View Highlight)
Implementing the inbox method with Redis is simple: each user gets a queue (a capped queue if you’re worried about memory running out) to work as their inbox and a set to keep track of the other users who are following them. Ashton Kutcher has over 5,000,000 followers on Twitter - at 100,000 writes a second it would take less than a minute to fan a message out to all of those inboxes. (View Highlight)
- Note: Certainly! Here’s a simple representation of the Redis-based inbox system with three users: Alice, Bob, and Carol. User queue and set: Alice: Queue (Inbox) → | Bob’s message | Carol’s message | Set (Followers) → { Bob, Carol } Bob: Queue (Inbox) → | Alice’s message | Carol’s message | Set (Followers) → { Alice, Carol } Carol: Queue (Inbox) → | Alice’s message | Bob’s message | Set (Followers) → { Alice, Bob } Each user has a Queue (Inbox) and a Set (Followers). The Queue holds messages from the users they follow, and the Set contains the users following them. When a user sends a message: Alice sends a message: “Hello, everyone!” Alice: Set (Followers) → { Bob, Carol } The system checks Alice’s Set (Followers) and sends the message to Bob and Carol’s Queues (Inboxes): Bob: Queue (Inbox) → | Alice’s message: “Hello, everyone!” | Carol’s message | Carol: Queue (Inbox) → | Alice’s message: “Hello, everyone!” | Bob’s message | In this example, when Alice sends a message, the system first checks her Set (Followers) and finds that Bob and Carol follow her. Then, it adds the message to their respective Queues (Inboxes). This diagram provides a simplified illustration of how the Redis-based inbox system works.
• Have workers periodically report their load average in to a sorted set. • Redistribute load. When you want to issue a job, grab the three least loaded workers from the sorted set and pick one of them at random (to avoid the thundering herd problem). (View Highlight)
VoltDB as a relational database is not traditionally thought of as in the NoSQL camp, but I feel based on their radical design perspective they are so far away from Oracle type systems that they are much more in the NoSQL tradition. (View Highlight)
Data integrity. Most of the NoSQL systems rely on applications to enforce data integrity where SQL uses a declarative approach. Relational databases are still the winner for data integrity. (View Highlight)
Data independence. Data outlasts applications. In NoSQL applications drive everything about the data. One argument for the relational model is as a repository of facts that can last for the entire lifetime of the enterprise, far past the expected life-time of any individual application. (View Highlight)
Ad-hoc queries. If you need to answer real-time questions about your data that you can’t predict in advance, relational databases are generally still the winner. (View Highlight)
Complex relationships. Some NoSQL systems support relationships, but a relational database is still the winner at relating. (View Highlight)

Ayush Poddar's Wiki

Explorer

What the Heck Are You Actually Using NoSQL For - Highlights

Metadata

Highlights

New highlights added April 24, 2023 at 1:49 PM

Graph View

Table of Contents

Backlinks