NoSQL:

different from sql as it doesn't use a relational data model so is flexible
doesn't require database schema efficiency
All data in oe table so JOINs are faster
Horizontally scalable, meaning you can just use another shard instead of buying additional hardware

Types of noSQL

Key value store:
- Specifically built for high-performance requirements and most common.
- Uses key values with pointters to store data.
  - Document-Based Store:
- All data stored in one table
- Similar to key-value store but has encoding such as xml
  - Column-Based Store
- Stores information in columns instead of rows
- They are grouped into families
- Limitless column nesting data model
- Very fast for searches
  - Graph-Based Store
- Used for graphs (duh)
- Uses relationships and nodes, data is the information
- relationship is formed between the nodes

RDMBS

Is a relational database management system to interact with a relational database
Unlike a DBMS, Uses normalizatio, primary and foreign keys, integrity checks and ACID properties to ensure data is reliable.

The 3 V's of data science

Volume
- This is how much data is stored and is obviously key when talking about 'big' data
Velocity
- Velocity is the growth of the data, and the importance of it.
- It measures how fast the data is coming in.
Variety
- Things such as video, text and pdf are examples of non-traditional forms of data that need to be dealt with in big data

Polyglot persistance

Refers to the value in using multiple data storage technologies for different storage needs.
No one solution fits all
Different tech for different kinds of data

Lambda Architecture

Used to handle massive quantities of data by using both batch and stream-processing methods
3 layers:
Batch Layer
- Precomputes results with distributed processing system that can handle very large quantities of data.
- Perfect accuracy is required.
- Fixes any errors by recomputing based on the complete data set then updates the existing view, allowing it to reach perfect accuracy with reversible functions.
- Output is stored in read-only database with updates changing precomputed views.
- This is what Hadoop is and is the leading batch processor.
Speed layer
- This layer processes data streams in real time without completeness
- Sacrifices throughput as it aims to minimize latency by using real-time views into the data.
- The speed layer is used to fill the "gap" caused by the batch layers lag in providing views based on the most recent data.
- Not as accureate or complete as the ones eventually produced by the batch layer, but available far quicker.
Serving layer
- Output from 2 other layers are stored in the serving layer
- Responds to queries by returning pre-computed views from processed data.
- Cassandra is a dedicated store used in the serving layer.

Hadoop

written in Java
Uses large cluster of hardware to maintain big data.
Works on MapReduce algorithm made by Google.
Hadoop MapReduce is a programming model for processing big data sets with a parralel distrobuted algorithm.
One challenge to MapReduce is the sequential multi-step process it take sto run a job. This makes MapReduce jobs slower because each step requires a disk read and write.

Hadoop MapReduce Phases

Mapping Phase
- Two steps in this phase, splitting and mapping.
- Data is split into equal units called chunks (input splits) in the splitting step.
- Hadoop consists of a RecordReader that uses TextInputFormat to transform input splits into key-value pairs
- Key-value pairs are used as inputs in the mapping step.
- Only data format that a mapper can read or understand.
- Mapper processes the key-value pairs and produces an output of the same form (key-value pairs).
Shuffling phase:
- This phase removes the duplicate values and groups value.
- Different values with similar keys are grouped. the output of this phase will be keys and values, just like in the mapping phase.
Reducer phase:
- Outputs of the shuffling phase is used as the input.
- Processes this input more to reduce the intermediate values into smaller values.
- Gives a summary of the entire datasets

Hadoop Distrobuted File System(HDFS)

Fault-tolerant file system.
designed to be deployed on low-cost, commodity hardware.
Provides high throughput data access to application data.

Cassandra

Cluster

Cassandra DB is distrobuted over multiple machines that work together.
Each node is a replica to handle failures
nodes are in a cluster ring format

Keyspace

Replication factor is the number of machines the information is stored on
Replica placement strategy is simply the strategy to place replicas in the 'ring' shown above. Strategies include
- Simple strategy (rack aware strategy)
- old network topology strategy (rack aware strategy)
- network topology strategy (datacenter-shared strategy)
Column families, keyspace is a container for a list of one or more column families. A column family is a collection of rows with ordered columns.

Apache Spark

Spark was created to address the limitations present in MapReduce by doing processing in-memory avoiding steps and reuing data across parallel operations.
Only one step is needed with spark where the data is read into memory, operations are done, and the results are written back making it much faster.
Spark also reuses data by using in-memory cache to speed up machine learning algorithms that repeatedly call a function on the same dataset.
This all makes spark multiple times faster than MapReduce especially when doing machine learning and interactive analytics.

Resilient Distributed Dataset (RDD)

Every spark application has an abstraction of RDD, which is a collection of elements partitioned across nodes of the cluster that can be opereated on in parallel.
They are created by starting with a file in the Hadoop file system
RDDs automatically recover from node failures.

Erlang

Used for building concurrent software.
Pros of use:
- It is a simple language to use
- It scales very well as once the deployment and integration is set up, scaling erlange nodes and spawning new processess on different machines becomes easy.
- When a program fails it is able to re-span any failed process automatically making it a safe language to use for concurrency.
Cons:
- It has a convoluted setup process
- Is a dynamically typed language so it's possible for problems to not be found at compile time and instead will be found during running of the code.

Local clustering coefficient

The local clustering coefficient of a vertexx node is how close its neighbours are to being a complete graph

CAP Theorem

Also named Brewer's theorem states that any distributed data store can provide only two of the following three quarantees, but never all 3.

Three guarantees (only 2 at once):

Consistency - Every read recieves the most recent write or an error
Availability - Every request recieves a non error response, without the guarantee that it contains the most recent write.
Partition tolerance - The system continues to work despite messages being dropped or delayed by network nodes.

ACID vs BASE model of db functionality.

Acid details:

Base Stands for

Basically Available - Rather than enforcing immediate consistency, BASE-modelled NoSQL databases will ensure availability of data by spreading and replication it across the nodes of the database cluster.
Soft State - Due to the lack of immediate consistency data values may change over time, the BASE model breaks off with the idea that a database must enforce its own consistency, leaving that to individual developers.
Eventually Consistent - the fact that BASE does not enforce immediate consistency does not mean that it never achieves it. but until it does it's still possible to read data.

Base differences:

ACID-compliant databases are better when consistency, predicatbility, and reliability are required.
When growth is the highest priority, BASE is a good option as it's easier to scale it up and gives more flexibility.
BASE requires developers who are experienced and know how to deal with it's limitations of enforcing consistency on their own.

Scala Basics

Means Scalable Language as it's entire purpose is to scale easily.

it's a blend of oop and functional programming language with static typing.
Brings in the best of both paradigms
High level abstraction and conciseness of functional languages
Flexibility, encapsulation and modularity from object oriented languages
Compiles to JVM so can use java libraries and types
order of magnitude less lines than java
statically typed Functional components:
Operations map inputs to outputs instead of re-write so no side-effects
no i=i+1 in functional language
no global state, no loops, no destructive state
Significantly easier concurrency
- The lack of side effects means no race conditions or deadlocks
- parallel programs are easier to write
- Very easily scaleable
- Share-nothing approch is ideal for distributed systems
Many big data frameworks have binding for Scala so easier to write than in Java or Python

Basic Scala syntax

val - immutable variable
var - mutable variable
val \:\ = \
Data types Int, Byte, Short, Long, Char, Float, Double, Boolean, String

List info:

Lists are immutable and have a recursive structure
can 'add' to a list by doing x :: xs where x is the first element and xs is the current list.
Functions for list
- head - first element
- tail - everython but the first element
- isEmpty
- ::: - list concatination
- length
- init, last
- reverse
- drop, take, splitAt

Array info:

Similar to lists but they are mutable
Not recursive.
Defined similar to lists
val = new Array[\][\]
Can convert between the two with elements, toArray, toList

Input output in scala:

println(\)
Source.fromFile(\).getLines

Classes in Scala

Object created using new

class MyClass {}

val x = new MyClass