Questions for the course "Big data and data science".
Click the button to start the quiz
What are the three V's in Big data?
Volume
Variety
Velocity
What dooes the term 'Volume' refer to in Big data?
The amount of data
The size of the data
What does the term 'Variety' refer to in Big data?
The different types of data
The different sources of data
What does the term 'Velocity' refer to in Big data?
The speed of data
The rate at which data is generated
What is Horizontal scaling?
Adding more machines to the system
Adding more nodes to the system
What is Vertical scaling?
Adding more power to the system
Adding more memory to the system
Adding more CPU to the system
Adding more storage to the system
What are some of the benefits with using vertical scaling?
Easier to manage
Less expensive, to start out with
What are some of the benefits with using horizontal scaling?
Easier to scale
Less expensive, in the long run
Commodity hardware
What are some caveats with using horizontal scaling?
Customized software
Load balancing
Network latency
Data consistency
Data partitioning
Data distribution
What does ETL mean?
Extract, Transform, Load
What does ELT mean?
Extract, Load, Transform
What does EtLT mean?
Extract, transform, Load, Transform
Why use Hadoop over GFS (Google File System)?
Open source
Available
What are some key characteristics of the Parquet file format?
Column-based
Column optimized search
Optimized for Map-Reduce processing
Schema stored at the end of the file
quick quiringof values in a column
Good at compuing aggregates or averages
Good for nested data structures
Supports schema evolution but not schema changes
What are some key characteristics of the Avro file format?
Row based
Supports schema changes and evolution
Optimized for record exchanges
Schema stored in human readable format at the beginning of the file
Good for sharing entire records between applications
Good for logging & auditing
What are some key characteristics of Sqoop?
Tool for importing structured (SQL, etc.) into hadoop
Not event driven
No longer maintained / Is archived
What are some key characteristics of Flume?
Designed for importing logs into Hadoop
Importing unstructured data
Work well with streamed datasources
Fault tolerant
linearly scalable
event driven
Can be used to buffer incoming data
What is the main characteristic of Kafka?
Distributed streaming platform
Publish-subscribe messaging system
Fault tolerant