[Data competition] Kaggle ARC Top1 program interpretation

[Data competition] Kaggle ARC Top1 program interpretation

Author: SW King ****

Kaggle: Interpretation of Abstraction and Reasoning Challenge Top1


Competition background

Can a computer learn complex, abstract tasks from a few examples?

The current machine learning technology requires a lot of data and is fragile, and they can only understand the patterns they have seen before. Using current methods, an algorithm can acquire new skills through the exposure of a large amount of data, but the cognitive ability that can be broadly extended to many tasks is still elusive. This makes it very challenging to create systems that can handle the variability and unpredictability of the real world, such as home robots or self-driving cars.

However, other methods, such as inductive programming, provide the potential for more humane abstraction and reasoning. The Abstraction and Reasoning Corpus (ARC) provides a benchmark for measuring the acquisition of artificial intelligence skills for unknown tasks. The limitation is that only a few demonstrations can learn complex tasks. Artificial intelligence provides an opportunity to quickly solve future problems. Kaggle abstraction and reasoning challenges invite you to try to bring the future into the present!

The competition was hosted by franois Chollet, the creator of the Keras neural network library. Chollet's paper on intelligence measurement provides the background and motivation behind the ARC benchmark.

In this game, you will create an AI that can solve inference tasks that have never been seen before. Each ARC task contains 3-5 pairs of training inputs and outputs, as well as a test input. You need to use the patterns learned from the training examples to predict the corresponding output.

If successful, you will help bring computers closer to human cognition, and you will open a door to new artificial intelligence applications!

Question evaluation

1. Evaluation Index\

The competition uses the top 3 error rate as the evaluation index. For each task in the test set, you can predict up to 3 outputs for each test input grid. Each task output has a ground truth. For a given task output, if any of the 3 predicted outputs contains ground truth, the error of the task is 0, otherwise it is 1. The final score is the average error of all tasks.

Mathematically, for each task, your algorithm can make up to three predictions, where the error of the task with respect to the ground truth is:

Among them, when it is 0, otherwise it is 1. The overall error score is:

2. The form of output prediction

The training, evaluation, and test input data are all in the same JSON format. Input and output are stored in 2d's python list. However, the output prediction must be flattened into a string, and the list lines are separated by |.

For example, the output [[1,2],[3,4]] should be reformatted to |12|34| as a prediction. Each task can have up to 3 output predictions, and they should be separated by spaces. See sample.csv for details.

The following python code converts 2d list pred into the correct format:

def flattener(pred):
    str_pred = str([row for row in pred])
    str_pred = str_pred.replace(', ''')
    str_pred = str_pred.replace('[[''|')
    str_pred = str_pred.replace('][''|')
    str_pred = str_pred.replace(']]''|')
    return str_pred

3. Submit documents

For each task output_id in the test set, you can make up to 3 predictions. output_id is the id of the task, with an indicator indicating which output you want to predict for the task. Most tasks have only one output (ie 0), although some tasks have two outputs that must be predicted (ie 0, 1). The file should contain headers and have the following format:

00576224_0,|32|78| |32|78| |00|00|

Data description

The purpose of this competition is to create an algorithm that can solve abstract reasoning tasks. The format of the competition is very different from previous competitions, so please read this information carefully and refer to the supplementary documents as needed.

When viewing tasks, "test-taker" can access the input and output of the demo pair (training pairs), and the input of the test pair. The goal is to construct an output grid corresponding to the test input grid, using 3 trials for each test input. "Building an output grid" includes selecting the height and width of the output grid, and then filling each cell in the grid with a symbol (an integer between 0 and 9, which can be regarded as a color). Only the exact solution (all cells meet the expected answer) can be said to be correct.

An additional information and an interactive application can be found on the abstract and reasoning corpus github page to explore the goals of this competition. It is strongly recommended that you download and browse the interactive application, which is the best way to understand the goals of the competition.

Task file format\

Task files are stored in two ways:

  • training: Contains task files for training (400 tasks). Use these prototypes of your algorithm or train your algorithm to obtain cognitive prior knowledge related to ARC.
  • evaluation: task file containing 400 tasks for evaluation;

The tasks are stored in JSON format, and each JSON file contains a dictionary of two fields:

  • train: Describe input/output pairs, which is a list of pairs (usually three pairs);
  • test: test input/output pair, a list of pairs (usually 1 pair);

A pair is a dictionary with two fields:-"input": the input "grid" of the pair;-"output": the output "grid" of the pair.

A "grid" is a rectangular matrix between 0-9, the smallest possible grid size is 1*1, and the largest is 30 * 30;

Note: For this competition, we retain the language of the ARC data set, but this will cause some ambiguity. Both training and evaluation tasks include training and testing pairs of input and output. The leaderboard will be scored using 100 invisible test tasks. The test task will contain training pairs of input and output, but you only get the task test input. Your algorithm must predict the test output. For the 100 tasks used to score leaderboards, most only one test input requires a corresponding output prediction, although for a few tasks, you will be required to make predictions on two test inputs. The output_id shown in the example in sample_submission.csv represents the task id as the test input id.


  • training: a folder of task files containing training data for 400 tasks;
  • Evaluation: A folder of task files, containing 400 tasks, use these to do the final algorithm evaluation.
  • test:
    • The test folder on the data page contains the first 100 tasks copied from the evaluation folder, but there is no answer; its purpose is to ensure that the code works properly when synchronized re-run.
    • The synchronously re-run test folder contains 100 (unviewed) tasks for scoring the model; these same unviewed 100 tasks are used for public and private leaderboards. You will not have access to these tasks, and your public/private score will be calculated during the rerun of the synchronization.
  • sample_submission.csv: The sample submission file in the correct format; the output_id column contains the task to be predicted and the test output id; the output_id list of the sample submission on this data page will be different from the rerun notebook synchronization.

Top1 program interpretation

The main component of the core solution is a DSL , which applies 4 of 142 unary transformations (based on 42 different functions, some of which have multiple variants). I effectively enumerate the transformations by reducing repetition, and then combine them by greedily stacking them to fit the training samples. Everything is implemented in C++ (no dependencies) and can be run in parallel. A simple scheduler tries to make the most of the 9 hours/16gb memory budget.

1. Transformation selection and engineering

The most important points of image conversion:

  • Cut(Image) -> list of images
    • Try to find a background color and divide the remaining pixels into groups connected by corners;
  • filterCol (image, color) -> image
    • Delete all colors except the given color (set it to 0);
  • colShape (image, color) -> image
    • Change all non-zero pixels to "color".
  • composeGrowing (list of images) -> image
    • Stack the list of images together, treating 0 as transparent. The image with the fewest non-zero pixels is at the top.
  • compress (image) -> image
    • Extract the smallest sub-image containing all non-zero pixels
  • rigid (image, id) ->image
    • Rotate (0/90/180/270) and flip
  • pickMax (list of images, id) -> image
    • Extract the image with the largest attribute, such as id=0 to extract the image with the most non-zero pixels.

I manually solve 100 training tasks and 100 evaluation tasks, and then extract useful functions to construct transformations. Generalize when it looks reasonable (like add all rotations if I use one of them). I did not try to cut and convert, because the given tasks do not seem to represent the tasks needed on the leaderboard.

The transformations stack up very well, and even solved several tasks that I used other transformations (models not available) when solving manually.

2. Integration

In the last model, I ran 4 different configurations and integrated the prediction results.

I run the conversion search depth 3, the depth 3 increases the diagonal flip (multiply by the two diagonal flips), and finally the depth 4 runs until the time or memory runs out. The best forecast is selected based on the following criteria, the highest standard is the most important criterion:

  • Processed the most training samples
  • Minimum depth solution
  • Images stacked with greedy stacker least

3. Tricks

The task of flipping diagonally to increase my sample is a simple technique that gives me significantly better scores . By remapping colors according to some heuristics to preprocess all samples, the effect is surprisingly good.

I believe my main advantage over other competitors is my experience in competitive programming. It allows me to quickly and efficiently write a large number of image conversions in C++, which allows me to search for more conversion combinations compared to Python implementations or other less optimized solutions.

4. Running time

A natural way to make my method run faster is to reduce the search depth. When I used it for a full 9 hours, I was able to run about half of the problem at depth 4 and about 20 times faster at depth 3 (20 times less memory occupied). During the development process, I will run on depth 2, which is 15 times faster than depth 3, while solving 80% of the tasks on the evaluation set.


  1. https://www.kaggle.com/c/abstraction-and-reasoning-challenge/discussion/154597
  2. https://www.kaggle.com/c/abstraction-and-reasoning-challenge/overview


 qq 704220115