Decisive Spring Recruitment (1)-Interviewers often ask you about massive data processing

Decisive Spring Recruitment (1)-Interviewers often ask you about massive data processing

Personally think this is a weakness of job seekers, because students really rarely have access to this kind of questions, but at the same time, interviewers love to take this kind of questions.

For massive data, common methods are: hash method, bitmap method, bloom filter method, database optimization method, inverted index method, external sort method, Trie tree, heap, double bucket sort method, and use mapreduce


The main advantage of the hash table is the O(1) lookup. Since its storage order is basically unnecessary, the storage time will not become longer as the data size increases, but the hash access will conflict, so Not necessarily suitable for massive data


As the name suggests, the bitmap method uses one bit to represent a piece of data. In this way, we need to know the data range in advance. For example, if the data range is 1~10, we will apply for 10-bit space, and the data range is 1~10310^3 , We will apply for one thousand seats.

The main application range of bitmap method is sorting O(N) and checking duplicate O(N), compared to otherO(log2N)O(log_2 N) sorting, the bitmap method is very fast, but it is essentially space for time, so the data distribution should not be sparse, so the price/performance ratio is undoubtedly very low

Brue Filtration

It belongs to the unity of the above two methods, defining m-bit arrays and n different hash functions. When adding a piece of data, first use n different hash functions to obtain n 1s (the n may not be different from each other), and set all n positions of the array to 1. When looking for data p in the array, if hash If the n result positions of is not all 1, the data p must not exist .

To sum up, although this method saves a certain amount of space, it also sacrifices accuracy. The accuracy here means that the result is certain to be accurate if it does not exist when searching.

Another: the number of hash functionsn=(ln2) (m/N)n = (ln2) * (m/N) least error rate, which is approximately equal to(m/n) 0.7Times(m/n) * 0.7 times , CBF and SBF is based on this expansion, CBF to extend the number of bits of each bit group of a counter, supports delete operations, SBF is used to approximate the minimum value of the occurrence frequency counter element.

Database optimization method

Common optimization ideas

  • Data partition
  • index
  • Caching mechanism
  • Virtual storage
  • Batch processing
  • Use temporary tables and intermediate tables
  • Optimize query statement
  • Use view
  • Use stored procedures
  • Use sorting instead of unsorted access
  • Use sampled data for data mining

Inverted index

In the actual references of search engines, sometimes it is necessary to find records according to certain values of keywords, so the index is built according to the keywords, which is the inverted index

The inverted index has two forms

  1. The horizontal inverted index of a record contains a list of documents for each quoted word
  2. The horizontal inverted index of a word contains the position of each word in a document

The second type provides more compatibility, but also requires more time and space

Compared with the forward index, which is more suitable for full-text query index form, because the inverted index is the word pointing to the document containing the word, it is possible to complete the query combination, intersection, etc. in the table when there are multiple keywords Logical operation, access to records, improve search speed

Outer sort

Outer sorting is relative to inner sorting. The records to be sorted are stored on the external memory. The files to be sorted cannot be loaded into the memory at one time, and the memory and the external memory are required to exchange data multiple times. Generally, external sorting is realized by merge sorting and other methods.

Merge sort :

  • Divide the data into n data segments, and perform intra-segment sorting on each data segment
  • Merge the data segments two or two times, and finally make the file orderly

Trie tree

It is also called a dictionary tree, which greatly saves space and time by using the common prefix of the string. It is mainly used for statistics and saves a large number of strings. I will publish an algorithm study of dictionary tree recently

  • The root node does not contain a string, except for the root node, each node contains only one character
  • From the root node to a certain node (only if it is the tail node), the characters passed on the path are connected to the string corresponding to the node
  • The node maintains an integer. If it is 0 or initialized, it means that this is not a tail node, otherwise it means that there are several strings to set it as the tail node


The heap is a tree structure, so the interior is disordered, but the parent node must be larger than the child node (large root heap), or the child node must be larger than the parent node (small root heap). From this structure, we can guess that the heap is good at finding the best value. Using double heap, we can find the median.

Bucket sorting

Bucket sorting is actually the classic divide and conquer idea. According to the data range and the number of data, the data is divided into several buckets, and then sorted in the buckets, and then sorted between the buckets.

In addition, it is generally sorted when the key distribution range is relatively small, otherwise the number of bucketsn=(max min)/len+1n = (max-min)/len + 1 will be too much, wasting resources

MapReduce method

MapReduce is a distributed programming model that simplifies parallel computing. The main purpose is for large-scale cluster systems to perform parallel work on large data sets and use them for large-scale data parallel operations. In the architecture, MapReduce API provides Map and Reduce processing, and GFS distributed file system and BigTable distributed database provide data access.

MapReduceDivided mapand reduce, mapit is one of the data conversion, reducea plurality of data into a data statute, such as sum, product, etc.

Next are some common problem types

TopK questions

Take the largest k number, the most frequent k number, the hot search list, the most popular query terms, etc...

Usually a better solution is [divide and conquer + trie tree/hash + small top heap]: First divide the data set into multiple small data sets according to the hash method, and then use trie tree or hash to count the search term frequency in each small data set , Then use the small top heap to find the k numbers with the highest frequency in each data set, and finally merge them.

  1. Local elimination method, always maintain an array of size k
  2. Divide and conquer
  3. hash method
  4. Small top pile

Of course, the specific problem is analyzed in detail, and we will further subdivide the problem

Single machine single core large memory

That means the data is not large enough

Single machine multi-core large memory

The usage hashmethod divides the data into n parts, each part is handed over to a thread for processing, and finally a thread is used to merge the results. There is a bottleneck in this method that will obviously affect the efficiency, that is, the data is skewed, the processing speed of each thread may be different, the efficiency of the fast thread is covered by the slow thread, the solution is to further divide the area, and each thread completes the current small partition After the operation, receive the next small partition.

Single machine single core small memory

Cut data into small files


Multi-machine small memory

Distribute data to multiple machines [ hash+ socket], and each machine uses a single machine with a single core and small memory.

TOPK problem is very suitable to be MapReducesolved. Users only need to write a map function and two reduce functions, and then submit them to Hadoop (using mapchain and reducechain) to solve the problem. Map is responsible for delivering data with the same hash value to the same reduce task. , The first reduce counts the frequency of each word, and the second word counts the top k in the output data of all reduce tasks.

Duplicate question

The problem arises with the bitmap method. There are only two possibilities for a bit: 0 or 1, corresponding to the state, there are only two possibilities that have not appeared and have appeared

So we choose to use two bits to represent one data

Sorting problem

  1. Database sorting

  2. Divide and conquer

  3. Bitmap