We’re talking databases, indexes, search engines, and why they’re basically microwaves in this episode while Joe wears a polo, Allen’s quick brown fox jumps over whatever, and Michael gives out fake URLs.
- Datadog.com/codingblocks – Sign up today for a free 14 day trial and get a free Datadog t-shirt after creating your first dashboard.
- Thank you to everyone that left us a review:
- iTunes – vr6apparatus, The All Ighty Ollar, BonerGs, MatiasWasAlreadyTaken, galusssss
- Stitcher – Vothis, JavaRuns100BillionDevices, SoulSurvivor, Twentyone
- The Cynical Developer (James Studdart) has Sean Martz on the show to talk about transitioning from a sys admin to a developer. (Cynical Developer Episode 79)
- In the Orlando area on June 21? Come see Joe’s talk on Building Search Driven Apps with Elasticsearch at the Orlando Backend Developer’s Meetup.
- Joe invents a new way to listen to podcasts … by topic.
Search Driven Apps
What’s the problem? Why do we need search driven apps?
- Search is a core tenant of modern computing
- Google, Shopping, Cortana, Spotlight, Contacts
- Can’t use the phone, listen to music, buy underwear or watch TV without searching
- In some cases, the number of items we’re searching are small, can get away with a simple LIKE % search
- But people aren’t very good at searching, and Google has spoiled us
- Auto-Complete helps with typos
- What if there’s too much data? Or the user doesn’t know what the thing is called?
- And what if there’s a lot of data?
- Auto-Complete helps with typos
Some interesting numbers:
- Google servers 40k searches … per second (Ardor Seo)
- Amazon sells over 500M products (ScrapeHero)
- Splunk indexes 100’s of TB per day
- 5B videos watched on YouTube every day (MerchDope)
Q: How are these services doing it?
A: Lots and lots of joins, and LIKE clauses?
Relational Databases …
- Don’t scale horizontally very well
- Are designed for ACID instead of BASE
- From Wikipedia:
- Atomicity – requires that each transaction be “all or nothing”:
- Consistency – ensures that any transaction will bring the database from one valid state to another.
- Isolation – ensures that the concurrent execution of transactions results in a system state that would be obtained if transactions were executed sequentially
- Durability – ensures that once a transaction has been committed, it will remain so, even in the event of power loss, crashes, or errors
- From Wikipedia:
- BASE – Basic Availability, Soft State, Eventual Consistency
- In regards to the CAP Theorem, Relational DB have an emphasis on Consistency
- CAP Theorem
- From Wikipedia:
- Consistency – every read receives the most recent write or an error
- Availability – every request receives a (non-error) response – without guarantee that it contains the most recent write
- Partition Tolerance – the system continues to operate despite an arbitrary number of messages being dropped (or delayed) by the network between nodes
- From Wikipedia:
- “Shredding” is inefficient (normalizing data for storage, and re-constructing it for retrieval over and over again)
- Writing dynamic ANSI SQL is the pits
DB vs Search Engine…why not just do it all in SQL?
- Why not just turn on full text search in SQL Server?
- Full-Text Search (Microsoft Developer Network)
- Elasticsearch Search Reference (Elastic Docs)
- Ability to search across all indexes, quickly and easily…
- Fast aggregations (facets)
- Ranged aggregations
- Migrations / Versions
- Simpler query requests
Ever written a SQL query that started out simple
- And ended up a monster?
- Start with a LIKE
- Add some dynamic JOINs
- In clauses lead to temp tables
- Paging leads to support for multi-page actions …
- CTE, Pivoting, OpenQuery
Writing dynamic SQL stinks.
How Inverted Index Searches solve the problem
So … why don’t we use a NoSQL implementation, denormalize the heck out of it, and then index the heck out of it, then map reduce the heck out of it – Well… you just described a basic search engine!
- NoSQL DB: Horizontal Scaling
- DeNormalize our data
- Index everything!
- Map/Reduce to search quickly
- Declarative language
A little bit about indexes …
- Forward index, and …
- Inverted index (Wikipedia)
- Let’s consider them both at the same time using a book analogy.
- A table of contents is a _forward index_. It tells you where to go in the book for a “document” (aka a chapter).
- The index you’ve seen in books is an _inverted index_. It tells you where to go in the book for all uses of the relevant words.
- Great answer on Stack Overflow.
But What’s a Reverse Index?
- Let’s consider a row in a database the “document” we want.
- A typical DB index would be a forward index. In other words, when we want row with ID = 123, we can seek right to it.
- Or put another way, a typical DB index acts like a table of contents for the “document” (i.e. row) we want
- Busy databases/indexes can suffer from index block contention, especially those indexes for monotonically increasing sequences. Like numerical based primary keys.
- Consider that the index for a table is stored across several “blocks”. Let’s hypothetically say each block contains 100 keys.
- Let’s also consider our table has a integer primary key that sequentially increases. So we have keys like 123, 124, 125, etc.
- So, 0-99 are in the first block, 100-199 are in the second block, 200-299 are in the third block, etc.
- If 100 write requests happen at the same time, it’s possible they could all queue up to write to the same block.
- Best case, two blocks given that scenario
- This is what a reverse index aims to solve.
- A reverse index simply reverses the key
- So for key = 123, it would go in the block as 321 (i.e. in our “300” block) and key = 124 would go in another block (in our “400” block) as 421, etc.
- This way reads/writes are distributed.
- In the case of 100 concurrent write requests, they would be spread across 10 different indexes given this scenario.
Back to Inverted Indexes
Take the sentence “The Quick Brown Fox Jumps Over the Lazy Dog”
- Store the entire document
- Throw away the “stop” words (like “the”)
- Break the sentence down into tokens for a hash table
- Each token contains an array of pointers to records
- When the user searches for “lazy fox” we can break that sentence into tokens, and return the records that are pointed to by our indexes
- Advanced rules apply: synonyms, antonyms, stemming, pluralization, scoring, etc
- Because of the storage mechanism, counting is easy too – we don’t even need to fetch the document, just count pointers
Inverted Index Search Engines
- Slow to Write, Fast to Read (Sort of…there’s near real time on many now)
- NoSQL before it was called NoSQL
- Scale Super Well
- Wicked Fast Reads
- Support Complex Filters
- Declarative syntax
Popular Search Engines
- Elasticsearch (and the Elastic Stack)
- Elasticsearch is open source (GitHub), as well as most of the other products in the Elastic Stack
- Elasticsearch BV (the company behind Elasticsearch) recently made the source code for X-Pack available (still requires a license to use) (Elastic blog announcement)
- X-Pack adds additional functionality to Elasticsearch including security, alerting, monitoring, reporting, graph analytics, dedicated APM UIs, and machine learning.
- Apache Lucene
- Apache Solr (built on Lucene)
- AWS CloudSearch
- AWS Elasticsearch Service
- Azure Search
Resources We Like
- The Imposter’s Handbook (bigmachine.io)
- 37 Mind Blowing YouTube Facts, Figures and Statistics – 2018 (MerchDope)
- How many Google searches per day on average in 2018? (Ardor Seo)
- How Many Products Does Amazon Sell? – January 2018 (ScrapeHero)
- ACID (Wikipedia)
- CAP Theorem (Wikipedia)
- Inverted Index (Wikipedia)
- What’s the difference between an inverted index and a plain old index? (Stack Overflow)
- Reverse Index (Wikipedia)
- Introduction To Reverse Key Indexes: Part 1 (Richard Foote’s Oracle Blog)
- How does database indexing work? (Stack Overflow)
- Database Index (Wikipedia)
- Elasticsearch from the bottom up (Blog, YouTube)
- Stack Overflow / Stack Exchange performance statistics (Stack Exchange)
Tip of the Week
- Pattern matching with is (Microsoft Docs)
- Use git cherry-pick -n to pick multiple commits into a single new commit. (git-cherry-pick Documentation)
- Looking for a Python fiddle? Try https://pyfiddle.io/.
- Google’s Python Class (Google for Education)
- Execute T-SQL from Visual Studio Code using the mssql extension
- Use git bisect to find the commit that broke your code. (git-bisect Documentation)
- Build, deploy, and manage your next web project with all of the features of an enterprise on Netlify.